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Preface 


Taken  literally,  the  title  “All  of  Statistics”  is  an  exaggeration.  But  in  spirit, 
the  title  is  apt,  as  the  book  does  cover  a  much  broader  range  of  topics  than  a 
typical  introductory  book  on  mathematical  statistics. 

This  book  is  for  people  who  want  to  learn  probability  and  statistics  quickly. 
It  is  suitable  for  graduate  or  advanced  undergraduate  students  in  computer 
science,  mathematics,  statistics,  and  related  disciplines.  The  book  includes 
modern  topics  like  nonparametric  curve  estimation,  bootstrapping,  and  clas¬ 
sification,  topics  that  are  usually  relegated  to  follow-up  courses.  The  reader  is 
presumed  to  know  calculus  and  a  little  linear  algebra.  No  previous  knowledge 
of  probability  and  statistics  is  required. 

Statistics,  data  mining,  and  machine  learning  are  all  concerned  with 
collecting  and  analyzing  data.  For  some  time,  statistics  research  was  con¬ 
ducted  in  statistics  departments  while  data  mining  and  machine  learning  re¬ 
search  was  conducted  in  computer  science  departments.  Statisticians  thought 
that  computer  scientists  were  reinventing  the  wheel.  Computer  scientists 
thought  that  statistical  theory  didn’t  apply  to  their  problems. 

Things  are  changing.  Statisticians  now  recognize  that  computer  scientists 
are  making  novel  contributions  while  computer  scientists  now  recognize  the 
generality  of  statistical  theory  and  methodology.  Clever  data  mining  algo¬ 
rithms  are  more  scalable  than  statisticians  ever  thought  possible.  Formal  sta¬ 
tistical  theory  is  more  pervasive  than  computer  scientists  had  realized. 

Students  who  analyze  data,  or  who  aspire  to  develop  new  methods  for 
analyzing  data,  should  be  well  grounded  in  basic  probability  and  mathematical 
statistics.  Using  fancy  tools  like  neural  nets,  boosting,  and  support  vector 
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machines  without  understanding  basic  statistics  is  like  doing  brain  surgery 
before  knowing  how  to  use  a  band-aid. 

But  where  can  students  learn  basic  probability  and  statistics  quickly?  Nowhere. 
At  least,  that  was  my  conclusion  when  my  computer  science  colleagues  kept 
asking  me:  “Where  can  I  send  my  students  to  get  a  good  understanding  of 
modern  statistics  quickly?”  The  typical  mathematical  statistics  course  spends 
too  much  time  on  tedious  and  uninspiring  topics  (counting  methods,  two  di¬ 
mensional  integrals,  etc.)  at  the  expense  of  covering  modern  concepts  (boot¬ 
strapping,  curve  estimation,  graphical  models,  etc.).  So  I  set  out  to  redesign 
our  undergraduate  honors  course  on  probability  and  mathematical  statistics. 
This  book  arose  from  that  course.  Here  is  a  summary  of  the  main  features  of 
this  book. 

1.  The  book  is  suitable  for  graduate  students  in  computer  science  and 
honors  undergraduates  in  math,  statistics,  and  computer  science.  It  is 
also  useful  for  students  beginning  graduate  work  in  statistics  who  need 
to  fill  in  their  background  on  mathematical  statistics. 

2.  I  cover  advanced  topics  that  are  traditionally  not  taught  in  a  first  course. 
For  example,  nonparametric  regression,  bootstrapping,  density  estima¬ 
tion,  and  graphical  models. 

3.  I  have  omitted  topics  in  probability  that  do  not  play  a  central  role  in 
statistical  inference.  For  example,  counting  methods  are  virtually  ab¬ 
sent. 

4.  Whenever  possible,  I  avoid  tedious  calculations  in  favor  of  emphasizing 
concepts. 

5.  I  cover  nonparametric  inference  before  parametric  inference. 

6.  I  abandon  the  usual  “First  Term  =  Probability”  and  “Second  Term 
=  Statistics”  approach.  Some  students  only  take  the  first  half  and  it 
would  be  a  crime  if  they  did  not  see  any  statistical  theory.  Furthermore, 
probability  is  more  engaging  when  students  can  see  it  put  to  work  in  the 
context  of  statistics.  An  exception  is  the  topic  of  stochastic  processes 
which  is  included  in  the  later  material. 

7.  The  course  moves  very  quickly  and  covers  much  material.  My  colleagues 
joke  that  I  cover  all  of  statistics  in  this  course  and  hence  the  title.  The 
course  is  demanding  but  I  have  worked  hard  to  make  the  material  as 
intuitive  as  possible  so  that  the  material  is  very  understandable  despite 
the  fast  pace. 

8.  Rigor  and  clarity  are  not  synonymous.  I  have  tried  to  strike  a  good 
balance.  To  avoid  getting  bogged  down  in  uninteresting  technical  details, 
many  results  are  stated  without  proof.  The  bibliographic  references  at 
the  end  of  each  chapter  point  the  student  to  appropriate  sources. 
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Probability 


Inference  and  Data  Mining 


FIGURE  1.  Probability  and  inference. 


9.  On  my  website  are  files  with  R  code  which  students  can  use  for  doing 
all  the  computing.  The  website  is: 

http:  /  / www. stat.cmu.edu/ ~larry/all-of-statistics 

However,  the  book  is  not  tied  to  R  and  any  computing  language  can  be 
used. 

Part  I  of  the  text  is  concerned  with  probability  theory,  the  formal  language 
of  uncertainty  which  is  the  basis  of  statistical  inference.  The  basic  problem 
that  we  study  in  probability  is: 

Given  a  data  generating  process,  what  are  the  properties  of  the  out¬ 
comes? 

Part  II  is  about  statistical  inference  and  its  close  cousins,  data  mining  and 
machine  learning.  The  basic  problem  of  statistical  inference  is  the  inverse  of 
probability: 

Given  the  outcomes,  what  can  we  say  about  the  process  that  gener¬ 
ated  the  data? 

These  ideas  are  illustrated  in  Figure  1.  Prediction,  classification,  clustering, 
and  estimation  are  all  special  cases  of  statistical  inference.  Data  analysis, 
machine  learning  and  data  mining  are  various  names  given  to  the  practice  of 
statistical  inference,  depending  on  the  context. 
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Part  III  applies  the  ideas  from  Part  II  to  specific  problems  such  as  regres¬ 
sion,  graphical  models,  causation,  density  estimation,  smoothing,  classifica¬ 
tion,  and  simulation.  Part  III  contains  one  more  chapter  on  probability  that 
covers  stochastic  processes  including  Markov  chains. 

I  have  drawn  on  other  books  in  many  places.  Most  chapters  contain  a  section 
called  Bibliographic  Remarks  which  serves  both  to  acknowledge  my  debt  to 
other  authors  and  to  point  readers  to  other  useful  references.  I  would  especially 
like  to  mention  the  books  by  DeGroot  and  Schervish  (2002)  and  Grimmett 
and  Stirzaker  (1982)  from  which  I  adapted  many  examples  and  exercises. 

As  one  develops  a  book  over  several  years  it  is  easy  to  lose  track  of  where  pre¬ 
sentation  ideas  and,  especially,  homework  problems  originated.  Some  I  made 
up.  Some  I  remembered  from  my  education.  Some  I  borrowed  from  other 
books.  I  hope  I  do  not  offend  anyone  if  I  have  used  a  problem  from  their  book 
and  failed  to  give  proper  credit.  As  my  colleague  Mark  Schervish  wrote  in  his 
book  (Schervish  (1995)), 

“. . .  the  problems  at  the  ends  of  each  chapter  have  come  from  many 
sources.  . . .  These  problems,  in  turn,  came  from  various  sources 
unknown  to  me  ...  If  I  have  used  a  problem  without  giving  proper 
credit,  please  take  it  as  a  compliment.” 

I  am  indebted  to  many  people  without  whose  help  I  could  not  have  written 
this  book.  First  and  foremost,  the  many  students  who  used  earlier  versions 
of  this  text  and  provided  much  feedback.  In  particular,  Liz  Prather  and  Jen¬ 
nifer  Bakal  read  the  book  carefully.  Rob  Reeder  valiantly  read  through  the 
entire  book  in  excruciating  detail  and  gave  me  countless  suggestions  for  im¬ 
provements.  Chris  Genovese  deserves  special  mention.  He  not  only  provided 
helpful  ideas  about  intellectual  content,  but  also  spent  many,  many  hours 
writing  DT^Xcode  for  the  book.  The  best  aspects  of  the  book’s  layout  are  due 
to  his  hard  work;  any  stylistic  deficiencies  are  due  to  my  lack  of  expertise. 
David  Hand,  Sam  Roweis,  and  David  Scott  read  the  book  very  carefully  and 
made  numerous  suggestions  that  greatly  improved  the  book.  John  Lafferty 
and  Peter  Spirtes  also  provided  helpful  feedback.  John  Kimmel  has  been  sup¬ 
portive  and  helpful  throughout  the  writing  process.  Finally,  my  wife  Isabella 
Verdinelli  has  been  an  invaluable  source  of  love,  support,  and  inspiration. 

Larry  Wasserman 
Pittsburgh,  Pennsylvania 

July  2003 
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Statistics/Data  Mining  Dictionary 

Statisticians  and  computer  scientists  often  use  different  language  for  the 
same  thing.  Here  is  a  dictionary  that  the  reader  may  want  to  return  to 
throughout  the  course. 


Statistics 

estimation 

classification 

clustering 

data 

covariates 

classifier 

hypothesis 

confidence  interval 

directed  acyclic  graph 

Bayesian  inference 

frequentist  inference 

large  deviation  bounds 


Computer  Science 
learning 

supervised  learning 

unsupervised  learning 
training  sample 
features 
hypothesis 


Bayes  net 
Bayesian  inference 


PAC  learning 


Meaning 

using  data  to  estimate 
an  unknown  quantity 
predicting  a  discrete  Y 
from  X 

putting  data  into  groups 
(XuY1),...,(Xn,Yn) 
the  XiS 

a  map  from  covariates 
to  outcomes 
subset  of  a  parameter 
space  0 

interval  that  contains  an 
unknown  quantity 
with  given  frequency 
multivariate  distribution 
with  given  conditional 
independence  relations 
statistical  methods  for 
using  data  to 
update  beliefs 
statistical  methods 
with  guaranteed 
frequency  behavior 
uniform  bounds  on 
probability  of  errors 
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Probability 


1 

Probability 


1 . 1  Introduction 

Probability  is  a  mathematical  language  for  quantifying  uncertainty.  In  this 
Chapter  we  introduce  the  basic  concepts  underlying  probability  theory.  We 
begin  with  the  sample  space,  which  is  the  set  of  possible  outcomes. 


1.2  Sample  Spaces  and  Events 

The  sample  space  Q  is  the  set  of  possible  outcomes  of  an  experiment.  Points 
u  in  are  called  sample  outcomes,  realizations,  or  elements.  Subsets  of 
Q  are  called  Events. 

1.1  Example.  If  we  toss  a  coin  twice  then  =  {HH,  HT,TH,  TT}.  The  event 
that  the  first  toss  is  heads  is  A  =  {HH,  HT}.  m 

1.2  Example.  Let  u  be  the  outcome  of  a  measurement  of  some  physical  quan¬ 
tity,  for  example,  temperature.  Then  Q  =  R  =  (— oo,  oo).  One  could  argue  that 
taking  Q  =  R  is  not  accurate  since  temperature  has  a  lower  bound.  But  there 
is  usually  no  harm  in  taking  the  sample  space  to  be  larger  than  needed.  The 
event  that  the  measurement  is  larger  than  10  but  less  than  or  equal  to  23  is 
A=  (10,23].  ■ 
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1.3  Example.  If  we  toss  a  coin  forever,  then  the  sample  space  is  the  infinite 
set 

fi=  |a>  =  (a;i,a;2,u;3,...,)  :  w,  e  {i?,T}|. 

Let  E  be  the  event  that  the  first  head  appears  on  the  third  toss.  Then 

E  =  {(<^1,^2, ^3, uji  =  T, U2  =  T.ojs  =  H,  u>i  e  {H,T}  for  i  >  3|.  ■ 

Given  an  event  A,  let  Ac  =  {uj  e  ft  :  u  A}  denote  the  complement  of 
A.  Informally,  Ac  can  be  read  as  anot  A”  The  complement  of  ft  is  the  empty 
set  0.  The  union  of  events  A  and  B  is  defined 

B  —  u  G  H  i  to  G  A  or  to  G  B  or  uj  G  both} 

which  can  be  thought  of  as  “A  or  L?.”  If  Ai,  A2, . . .  is  a  sequence  of  sets  then 

00 

U  Ai  =  {w  G  0  :  lj  G  Ai  for  at  least  one  i|. 

2=1 

The  intersection  of  A  and  B  is 

A^B  =  {ou  E  ft  :  (j  £  A  and  u ;  G  B} 

read  “A  and  B”  Sometimes  we  write  A  p|  B  as  AB  or  (A,  B).  If  A\,  A2, . . .  is 
a  sequence  of  sets  then 

00 

Pi  Ai = jw  G  :  UJ  G  Ai  for  all  i|. 

2  =  1 

The  set  difference  is  defined  by  A  —  B  —  {u  :  uj  G  A,  uj  £  B}.  If  every  element 
of  A  is  also  contained  in  B  we  write  A  C  B  or,  equivalently,  B  D  A.  If  A  is  a 
finite  set,  let  |^4|  denote  the  number  of  elements  in  A.  See  the  following  table 
for  a  summary. 


A 

Ac 

AIJB 

A  P|  B  or  AB 
A-B 
A  c  B 
0 
ft 


Summary  of  Terminology 

sample  space 

outcome  (point  or  element) 

event  (subset  of  ft) 

complement  of  A  (not  A) 

union  (^4  or  B) 

intersection  (A  and  B) 

set  difference  (a;  in  A  but  not  in  B) 

set  inclusion 

null  event  (always  false) 

true  event  (always  true) 


1.3  Probability 


5 


We  say  that  A\ ,  A2, . . .  are  disjoint  or  are  mutually  exclusive  if  Ai  fj  Aj  = 
0  whenever  i  ^  j.  For  example,  A\  =  [0,1), A2  =  [1,2), A3  =  [2,3),...  are 
disjoint.  A  partition  of  O  is  a  sequence  of  disjoint  sets  Ai,  A2, . . .  such  that 
USi  Ai  —  Given  an  event  A,  define  the  indicator  function  of  A  by 

IA{w) if  w  g  A 

A  sequence  of  sets  Ai,A2,...  is  monotone  increasing  if  A\  c  A2  C 
•  •  •  and  we  define  lim™-^  An  =  IJSi  Ai.  A  sequence  of  sets  Ai,  A2, . . .  is 
monotone  decreasing  if  A\  D  A2  D  •  •  •  and  then  we  define  limn^oo  An  = 
0^:1  either  case,  we  will  write  An  — »  A. 

1.4  Example.  Let  =  R  and  let  Ai  =  [0, 1/z)  for  i  =  1, 2, . . ..  Then  IJ^i  ^  ~ 
[0, 1)  and  Ai  =  {0}.  If  instead  we  define  Ai  —  (0, 1  j%)  then  IJ^i  Ai  = 
(0, 1)  and  flSi  Ai  =  0-  ■ 


1.3  Probability 

We  will  assign  a  real  number  P(A)  to  every  event  A,  called  the  probability  of 
A.  1  We  also  call  F  a  probability  distribution  or  a  probability  measure. 
To  qualify  as  a  probability,  P  must  satisfy  three  axioms: 


1.5  Definition.  A  function  P  that  assigns  a  real  number  P(A)  to  each 
event  A  is  a  probability  distribution  or  a  probability  measure  if  it 
satisfies  the  following  three  axioms: 

Axiom  1;  P(A)  >  0  for  every  A 
Axiom  2:  P(fi)  =  1 

Axiom  3:  If  Ai,  A2, . . .  are  disjoint  then 


P 


IM 


i— 1 


1\t  is  not  always  possible  to  assign  a  probability  to  every  event  A  if  the  sample  space  is  large, 
such  as  the  whole  real  line.  Instead,  we  assign  probabilities  to  a  limited  class  of  set  called  a 
cr-field.  See  the  appendix  for  details. 
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There  are  many  interpretations  of  P(A).  The  two  common  interpretations 
are  frequencies  and  degrees  of  beliefs.  In  the  frequency  interpretation,  P(A) 
is  the  long  run  proportion  of  times  that  A  is  true  in  repetitions.  For  example, 
if  we  say  that  the  probability  of  heads  is  1/2,  we  mean  that  if  we  flip  the 
coin  many  times  then  the  proportion  of  times  we  get  heads  tends  to  1/2  as 
the  number  of  tosses  increases.  An  infinitely  long,  unpredictable  sequence  of 
tosses  whose  limiting  proportion  tends  to  a  constant  is  an  idealization,  much 
like  the  idea  of  a  straight  line  in  geometry.  The  degree-of-belief  interpretation 
is  that  P(A)  measures  an  observer’s  strength  of  belief  that  A  is  true.  In  either 
interpretation,  we  require  that  Axioms  1  to  3  hold.  The  difference  in  inter¬ 
pretation  will  not  matter  much  until  we  deal  with  statistical  inference.  There, 
the  differing  interpretations  lead  to  two  schools  of  inference:  the  frequentist 
and  the  Bayesian  schools.  We  defer  discussion  until  Chapter  11. 

One  can  derive  many  properties  of  P  from  the  axioms,  such  as: 

P(0)  =  0 

Ac  B  =*  P(A)  <  P  (B) 

0  <  P (A)  <  1 

P(AC)  =  l  -  P(A) 

Af\B  =  %  =►  p/4(Jb)  =P(A)+P(B).  (1.1) 

A  less  obvious  property  is  given  in  the  following  Lemma. 

1.6  Lemma.  For  any  events  A  and  B, 

P  (a  (J  b\  =  F(A)  +  P (B)  -  P (AB). 

Proof.  Write  A[jB  =  ( ABC )  \J(AB)  [j(AcB)  and  note  that  these  events 
are  disjoint.  Hence,  making  repeated  use  of  the  fact  that  P  is  additive  for 
disjoint  events,  we  see  that 

p(a|J.b)  =  P  ({ABC)  |J(AB)  (J(ACB)) 

=  P  (ABC)  +  F(AB)  +  F(ACB) 

=  F{ABC )  +  P (AB)  +  F(ACB)  +  F{AB)  -  F(AB) 

=  P  ((ABC)  (J(AB))  +  P  ({ACB)  |J(AB))  -  F(AB) 

=  P(il)  +  P(S)-P(AB).  ■ 

1.7  Example.  Two  coin  tosses.  Let  Hi  be  the  event  that  heads  occurs  on 
toss  1  and  let  H2  be  the  event  that  heads  occurs  on  toss  2.  If  all  outcomes  are 


1.4  Probability  on  Finite  Sample  Spaces  7 
equally  likely,  thenP(ifi  U-^2)  —  P(i?i)+P(i?2)— P(8i82)  —  \  +  \~\  ~  3/4. 


1.8  Theorem  (Continuity  of  Probabilities).  If  An  — >  A  then 

P (An)  P (A) 


as  n  — >  00. 


Proof.  Suppose  that  An  is  monotone  increasing  so  that  A\  C  A2  C  •  •  •. 
Let  A  =  limn_>0o  =  USi^-  Define  81  =  Ai,  B2  =  {uj  E  £2  :  cu  G 
A2,u;  ^  Ai},  .83  =  {a;  G  ft  :  u  £  A3,uj  £  A2,lo  £  Ax},...  It  can  be 
shown  that  81,  £2,  •  •  •  are  disjoint,  An  =  |J™=1  A*  =  IJlLi  ^  f°r  eac^  n  an<^ 
U^i  Bi  =  U^i  A*.  (See  exercise  1.)  From  Axiom  3, 


P(A„)  =  p  ( u 


EP(Bi) 

2=1 


and  hence,  using  Axiom  3  again, 


lim  P(An) 

n— >oo 


n  oo 

lim  =  V  P^B* 

n— >oo  ' 

2=1  2=1 


P(A).  ■ 


1.4  Probability  on  Finite  Sample  Spaces 

Suppose  that  the  sample  space  SI  =  {uj\, . . .  ,un}  is  finite.  For  example,  if  we 
toss  a  die  twice,  then  Q  has  36  elements:  Q,  =  {(i,  j)\  i,j  G  {1, . . .  6}}.  If  each 
outcome  is  equally  likely,  then  P(A)  =  |A|/36  where  \A\  denotes  the  number 
of  elements  in  A.  The  probability  that  the  sum  of  the  dice  is  11  is  2/36  since 
there  are  two  outcomes  that  correspond  to  this  event. 

If  ft  is  finite  and  if  each  outcome  is  equally  likely,  then 

r(A)  -  M 

which  is  called  the  uniform  probability  distribution.  To  compute  prob¬ 
abilities,  we  need  to  count  the  number  of  points  in  an  event  A.  Methods  for 
counting  points  are  called  combinatorial  methods.  We  needn’t  delve  into  these 
in  any  great  detail.  We  will,  however,  need  a  few  facts  from  counting  theory 
that  will  be  useful  later.  Given  n  objects,  the  number  of  ways  of  ordering 
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these  objects  is  n!  =  n(n  —  l)(n  —  2)  •  •  •  3  •  2  •  1.  For  convenience,  we  define 
0!  =  1.  We  also  define 

f  n\  n !  /.  ^ 

J  =  ,-.7--  /TTi  i  1-2 


\k )  k\{n  —  k)V 

read  “n  choose  k ” ,  which  is  the  number  of  distinct  ways  of  choosing  k  objects 
from  n.  For  example,  if  we  have  a  class  of  20  people  and  we  want  to  select  a 
committee  of  3  students,  then  there  are 


20!  _  20  x  19  x  18 
3!17!  ~~  3x2x1 
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possible  committees.  We  note  the  following  properties: 


1  and 


n  —  k 


1.5  Independent  Events 


If  we  flip  a  fair  coin  twice,  then  the  probability  of  two  heads  is  |  x  i.  We 
multiply  the  probabilities  because  we  regard  the  two  tosses  as  independent. 
The  formal  definition  of  independence  is  as  follows: 


1.9  Definition.  Two  events  A  and  B  are  independent  if 

F(AB)  =  F{A)F(B) 

and  we  write  All  B.  A  set  of  events  {A{  :  i  £  1}  is  independent  if 


(1.3) 


p(n^) =np(Ad 

\ieJ  J  ieJ 


for  every  finite  subset  J  of  I .  If  A  and  B  are  not  independent,  we  write 

A  W  B 


Independence  can  arise  in  two  distinct  ways.  Sometimes,  we  explicitly  as¬ 
sume  that  two  events  are  independent.  For  example,  in  tossing  a  coin  twice, 
we  usually  assume  the  tosses  are  independent  which  reflects  the  fact  that  the 
coin  has  no  memory  of  the  first  toss.  In  other  instances,  we  derive  indepen¬ 
dence  by  verifying  that  P(^4F?)  =  P(.A)P(23)  holds.  For  example,  in  tossing 
a  fair  die,  let  A  =  {2,4,6}  and  let  B  =  {1,2, 3, 4}.  Then,  Af]B  —  {2,4}, 
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F{AB)  =  2/6  =  P(^4)P(H)  =  (1/2)  x  (2/3)  and  so  A  and  B  are  independent. 
In  this  case,  we  didn’t  assume  that  A  and  B  are  independent  —  it  just  turned 
out  that  they  were. 

Suppose  that  A  and  B  are  disjoint  events,  each  with  positive  probability. 
Can  they  be  independent?  No.  This  follows  since  F(A)F(B)  >  0  yet  F(AB)  = 
P(0)  =  0.  Except  in  this  special  case,  there  is  no  way  to  judge  independence 
by  looking  at  the  sets  in  a  Venn  diagram. 

1.10  Example.  Toss  a  fair  coin  10  times.  Let  A  =  “at  least  one  head.”  Let  Tj 
be  the  event  that  tails  occurs  on  the  jth  toss.  Then 


F(A)  =  1  -  F(AC) 

=  1  —  P(all  tails) 

=  1-P(TiT2---Tio) 

=  1  —  P(Ti)P(T2)  •  •  •  P(Tio)  using  independence 

=  1  -  (I)  «  .999.  ■ 


1.11  Example.  Two  people  take  turns  trying  to  sink  a  basketball  into  a  net. 
Person  1  succeeds  with  probability  1/3  while  person  2  succeeds  with  proba¬ 
bility  1/4.  What  is  the  probability  that  person  1  succeeds  before  person  2? 
Let  E  denote  the  event  of  interest.  Let  Aj  be  the  event  that  the  first  success 
is  by  person  1  and  that  it  occurs  on  trial  number  j.  Note  that  Ai,A2,...  are 
disjoint  and  that  E  =  Ujli  Ar  Hence> 

oo 

p  (E)  =  X>(A,0- 

J  =  1 


Now,  P(.Ai)  =  1/3.  A2  occurs  if  we  have  the  sequence  person  1  misses,  person 
2  misses,  person  1  succeeds.  This  has  probability  P (A2)  =  (2/ 3) (3/4) (1/3)  = 
(l/2)(l/3).  Following  this  logic  we  see  that  P (Aj)  =  (l/2)J~1(l/3).  Hence, 


2 

3 


Here  we  used  that  fact  that,  if  0  <  r  <  1  then  Y^jLk  r J  —  rk /(I  —  r).  u 


10 


1.  Probability 


Summary  of  Independence 

1.  A  and  B  are  independent  if  and  only  if  F(AB)  =  P(A)P(jB). 

2.  Independence  is  sometimes  assumed  and  sometimes  derived. 

3.  Disjoint  events  with  positive  probability  are  not  independent. 


1.6  Conditional  Probability 

Assuming  that  P(-B)  >  0,  we  define  the  conditional  probability  of  A  given 
that  B  has  occurred  as  follows: 


1.12  Definition.  If¥(B)  >  0  then  the  conditional  probability 

given  B  is 


P(A|£)  = 


¥(AB) 
P(£)  ' 


of  A 
(1.4) 


Think  of  F(A\B)  as  the  fraction  of  times  A  occurs  among  those  in  which 
B  occurs.  For  any  fixed  B  such  that  P(F?)  >  0,  P(-|B)  is  a  probability  (i.e.,  it 
satisfies  the  three  axioms  of  probability).  In  particular,  F(A\B)  >  0,  P(fi|B)  = 
1  and  if  Ai,A2,...  are  disjoint  then  P(USi  =  But  ^ 

is  in  general  not  true  that  F(A\B\JC)  =  F(A\B)  4-  F(A\C).  The  rules  of 
probability  apply  to  events  on  the  left  of  the  bar.  In  general  it  is  not  the  case 
that  F(A\B)  =  F(B\A).  People  get  this  confused  all  the  time.  For  example, 
the  probability  of  spots  given  you  have  measles  is  1  but  the  probability  that 
you  have  measles  given  that  you  have  spots  is  not  1.  In  this  case,  the  difference 
between  F(A\B)  and  F(B\A)  is  obvious  but  there  are  cases  where  it  is  less 
obvious.  This  mistake  is  made  often  enough  in  legal  cases  that  it  is  sometimes 
called  the  prosecutor’s  fallacy. 

1.13  Example.  A  medical  test  for  a  disease  D  has  outcomes  +  and  — .  The 
probabilities  are: 


D 

Dc 

+ 

.009 

.099 

— 

.001 

.891 

1.6  Conditional  Probability 
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From  the  definition  of  conditional  probability, 

p(+n^)  -of 


p(+i£)  = 


and 


P  (D) 

p(-n^c) 


.009 

.009  +  .001 


=  .9 


PH^)  =  p(-n^= 

v  1  ;  P  (Dc)  .891  +  .099 

Apparently,  the  test  is  fairly  accurate.  Sick  people  yield  a  positive  90  percent 
of  the  time  and  healthy  people  yield  a  negative  about  90  percent  of  the  time. 
Suppose  you  go  for  a  test  and  get  a  positive.  What  is  the  probability  you  have 
the  disease?  Most  people  answer  .90.  The  correct  answer  is 

P  fi  +  =  v  '  = - «  .08. 

^  1  '  P(+)  .009 +  .099 

The  lesson  here  is  that  you  need  to  compute  the  answer  numerically.  Don’t 
trust  your  intuition.  ■ 


P(D|+) 


.08. 


The  results  in  the  next  lemma  follow  directly  from  the  definition  of  condi¬ 
tional  probability. 

1.14  Lemma.  If  A  and  B  are  independent  events  then  ¥(A\B)  =  P(A).  Also , 
for  any  pair  of  events  A  and  B, 

¥(AB)  =  ¥(A\B)¥(B)  =  ¥(B\A)¥(A). 

From  the  last  lemma,  we  see  that  another  interpretation  of  independence  is 
that  knowing  B  doesn’t  change  the  probability  of  A.  The  formula  P (AB)  = 
¥(A)¥(B\A)  is  sometimes  helpful  for  calculating  probabilities. 

1.15  Example.  Draw  two  cards  from  a  deck,  without  replacement.  Let  A  be 
the  event  that  the  first  draw  is  the  Ace  of  Clubs  and  let  B  be  the  event  that 
the  second  draw  is  the  Queen  of  Diamonds.  Then  ¥(AB)  =  ¥(A)¥(B\A)  = 
(1/52)  x  (1/51).  ■ 


Summary  of  Conditional  Probability 


1.  If  P(B)  >  0,  then 


¥(A\B) 


P(AB) 

¥(B) 


2.  P(-|B)  satisfies  the  axioms  of  probability,  for  fixed  B.  In  general, 

P(A|*)  does  not  satisfy  the  axioms  of  probability,  for  fixed  A. 

3.  In  general,  ¥(A\B)  ^  ¥(B\A). 
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4.  A  and  B  are  independent  if  and  only  if  P(^4|F?)  =  P(A). 


1.7  Bayes’  Theorem 

Bayes’  theorem  is  the  basis  of  “expert  systems”  and  “Bayes’  nets,”  which  are 
discussed  in  Chapter  17.  First,  we  need  a  preliminary  result. 


1.16  Theorem  (The  Law  of  Total  Probability).  Let  A  i, ...  ,Ak  be  a  partition 
ofQ.  Then,  for  any  event  B, 

k 

P(-B)  =  ^P(jB|i4i)P(j4i). 

2  —  1 

Proof.  Define  Cj  =  BAj  and  note  that  Ci, . . .  ,Cfc  are  disjoint  and  that 
B  =  Uj=i  Hence, 

p  (B)  =  £P(0)  =  =  ^P(S|^)P(^) 

3  3  3 

since  P {BAj)  =  F(B\Aj)F(Aj)  from  the  definition  of  conditional  probability. 


1.17  Theorem  (Bayes1  Theorem).  Let  A\,...,Ak  be  a  partition  of  Q  such 
that  P(v4*)  >  0  for  each  i.  IfF(B)  >  0  then,  for  each  i  =  1, . . . ,  k, 


F(Ai\B) 


FiBlA^Ai) 

E,-P(fi|^)P(^) 


1.18  Remark.  We  call  F(Ai)  the  prior  probability  of  A  and  F(Ai\B)  the 
posterior  probability  of  A. 

Proof.  We  apply  the  definition  of  conditional  probability  twice,  followed 
by  the  law  of  total  probability: 

P (AjB)  _  P(5|^)P(^)  _  P(g[^)P(^) 

n  '  P (B)  F(B)  E,  P(5|^)P(^)' 

1.19  Example.  I  divide  my  email  into  three  categories:  A\  =  “spam,”  A2  = 
“low  priority”  and  A%  =  “high  priority.”  From  previous  experience  I  find  that 
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P(Ai)  =  .7,  P(A2)  =  .2  and  P (As)  =  .1.  Of  course,  .7  +  .2  +  .1  =  1.  Let  B  be 
the  event  that  the  email  contains  the  word  “free.”  From  previous  experience, 
P(B|;4i)  =  .9,  ¥(B\A2)  =  .01,  P(B|Ai)  =  .01.  (Note:  .9  +  .01  +  .01  +  1.)  I 
receive  an  email  with  the  word  “free.”  What  is  the  probability  that  it  is  spam? 
Bayes’  theorem  yields, 


P(Ai|B)  = 


.9  x  .7 

(.9  x  .7)  +  (.01  x  .2)  +  (.01  x  .1) 


.995. 


1.8  Bibliographic  Remarks 

The  material  in  this  chapter  is  standard.  Details  can  be  found  in  any  number 
of  books.  At  the  introductory  level,  there  is  DeGroot  and  Schervish  (2002); 
at  the  intermediate  level,  Grimmett  and  Stirzaker  (1982)  and  Karr  (1993);  at 
the  advanced  level  there  are  Billingsley  (1979)  and  Breiman  (1992).  I  adapted 
many  examples  and  exercises  from  DeGroot  and  Schervish  (2002)  and  Grim¬ 
mett  and  Stirzaker  (1982). 


1.9  Appendix 

Generally,  it  is  not  feasible  to  assign  probabilities  to  all  subsets  of  a  sample 
space  ft.  Instead,  one  restricts  attention  to  a  set  of  events  called  a  cr-algebra 
or  a  <7-field  which  is  a  class  A  that  satisfies: 

(i)  0  g  A, 

(ii)  if  Ai,  A2, . . . ,  G  A  then  (J2U  Ai  e  A  and 

(iii)  A  e  A  implies  that  Ac  G  A. 

The  sets  in  A  are  said  to  be  measurable.  We  call  (VI,  A)  a  measurable 
space.  If  P  is  a  probability  measure  defined  on  A,  then  (ft,  A,  P)  is  called  a 
probability  space.  When  ft  is  the  real  line,  we  take  A  to  be  the  smallest 
cr-field  that  contains  all  the  open  subsets,  which  is  called  the  Borel  cr-field. 


1.10  Exercises 

1.  Fill  in  the  details  of  the  proof  of  Theorem  1.8.  Also,  prove  the  monotone 
decreasing  case. 

2.  Prove  the  statements  in  equation  (1.1). 
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3.  Let  ft  be  a  sample  space  and  let  Ai,  A2, ... ,  be  events.  Define  Bn  = 

\JT=n  Ai  and  Cn  =  iXn  Ai- 

(a)  Show  that  B\  D  B2  D  •  •  •  and  that  C\  C  C2  C  •  •  *. 

(b)  Show  that  uo  G  H^Li  Bn  if  and  only  if  uj  belongs  to  an  infinite 
number  of  the  events  Ai,  A2, .  • .. 

(c)  Show  that  cu  G  U^=i  Cn  if  and  only  if  u  belongs  to  all  the  events 
Ai,  A2, . . .  except  possibly  a  finite  number  of  those  events. 

4.  Let  {Ai  :  i  G  1}  be  a  collection  of  events  where  I  is  an  arbitrary  index 
set.  Show  that 

=f)Ai  and  (o)  =U^C 

del  /  iei  \iei  /  iei 

Hint:  First  prove  this  for  I  =  {1, . . . ,  n}. 

5.  Suppose  we  toss  a  fair  coin  until  we  get  exactly  two  heads.  Describe 
the  sample  space  S.  What  is  the  probability  that  exactly  k  tosses  are 
required? 

6.  Let  ft  =  {0, 1, . . . , }.  Prove  that  there  does  not  exist  a  uniform  distri¬ 
bution  on  ft  (i.e.,  if  P(A)  =  P(H)  whenever  \A\  =  |H|,  then  P  cannot 
satisfy  the  axioms  of  probability). 

7.  Let  Ai,  A2, . . .  be  events.  Show  that 


00 


CO 


,n=l 


n= 1 


Hint:  Define  Bn  =  An  —  UIL/  Then  show  that  the  Bn  are  disjoint 

I OO  A  I  ICO 


and  that  (Jn=1  An  =  |Jn=1  Bn. 

8.  Suppose  that  F(Ai)  =  1  for  each  i.  Prove  that 


CO 


p  fV*  =1- 


,i=l 


9.  For  fixed  B  such  that  P(H)  >  0,  show  that  P(*|H)  satisfies  the  axioms 
of  probability. 

10.  You  have  probably  heard  it  before.  Now  you  can  solve  it  rigorously. 
It  is  called  the  “Monty  Hall  Problem.”  A  prize  is  placed  at  random 
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behind  one  of  three  doors.  You  pick  a  door.  To  be  concrete,  let’s  suppose 
you  always  pick  door  1.  Now  Monty  Hall  chooses  one  of  the  other  two 
doors,  opens  it  and  shows  you  that  it  is  empty.  He  then  gives  you  the 
opportunity  to  keep  your  door  or  switch  to  the  other  unopened  door. 
Should  you  stay  or  switch?  Intuition  suggests  it  doesn’t  matter.  The 
correct  answer  is  that  you  should  switch.  Prove  it.  It  will  help  to  specify 
the  sample  space  and  the  relevant  events  carefully.  Thus  write  = 
{(^1,^2)  :  £  {1,2, 3}}  where  uj\  is  where  the  prize  is  and  002  is  the 

door  Monty  opens. 

11.  Suppose  that  A  and  B  are  independent  events.  Show  that  Ac  and  Bc 
are  independent  events. 

12.  There  are  three  cards.  The  first  is  green  on  both  sides,  the  second  is  red 
on  both  sides  and  the  third  is  green  on  one  side  and  red  on  the  other.  We 
choose  a  card  at  random  and  we  see  one  side  (also  chosen  at  random). 
If  the  side  we  see  is  green,  what  is  the  probability  that  the  other  side  is 
also  green?  Many  people  intuitively  answer  1/2.  Show  that  the  correct 
answer  is  2/3. 

13.  Suppose  that  a  fair  coin  is  tossed  repeatedly  until  both  a  head  and  tail 
have  appeared  at  least  once. 

(a)  Describe  the  sample  space  fl. 

(b)  What  is  the  probability  that  three  tosses  will  be  required? 

14.  Show  that  if  P(A)  =  0  or  P(A)  =  1  then  A  is  independent  of  every  other 
event.  Show  that  if  A  is  independent  of  itself  then  F(A)  is  either  0  or  1. 

15.  The  probability  that  a  child  has  blue  eyes  is  1/4.  Assume  independence 
between  children.  Consider  a  family  with  3  children. 

(a)  If  it  is  known  that  at  least  one  child  has  blue  eyes,  what  is  the 
probability  that  at  least  two  children  have  blue  eyes? 

(b)  If  it  is  known  that  the  youngest  child  has  blue  eyes,  what  is  the 
probability  that  at  least  two  children  have  blue  eyes? 

16.  Prove  Lemma  1.14. 

17.  Show  that 


F(ABC)  =  F{A\BC)F(B\C)F(C). 
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18.  Suppose  k  events  form  a  partition  of  the  sample  space  $1,  i.e.,  they 
are  disjoint  and  UiLi  A*  —  Assume  that  P (B)  >  0.  Prove  that  if 
F(Ai\B)  <  P(Ai)  then  F(Ai\B)  >  P(A$)  for  some  i  —  2, . . . ,  k. 

19.  Suppose  that  30  percent  of  computer  owners  use  a  Macintosh,  50  percent 
use  Windows,  and  20  percent  use  Linux.  Suppose  that  65  percent  of 
the  Mac  users  have  succumbed  to  a  computer  virus,  82  percent  of  the 
Windows  users  get  the  virus,  and  50  percent  of  the  Linux  users  get 
the  virus.  We  select  a  person  at  random  and  learn  that  her  system  was 
infected  with  the  virus.  What  is  the  probability  that  she  is  a  Windows 
user? 

20.  A  box  contains  5  coins  and  each  has  a  different  probability  of  show¬ 
ing  heads.  Let  pi, . . .  ,ps  denote  the  probability  of  heads  on  each  coin. 
Suppose  that 

Pi  =0,  p2  =  1/4,  p3  =  1/2,  p4  =  3/4  and  p5  =  1. 

Let  H  denote  “heads  is  obtained”  and  let  C\  denote  the  event  that  coin 
i  is  selected. 

(a)  Select  a  coin  at  random  and  toss  it.  Suppose  a  head  is  obtained. 
What  is  the  posterior  probability  that  coin  i  was  selected  (i  —  1, . . . ,  5)? 
In  other  words,  find  F(Ci\H)  for  i  =  1, . . . ,  5. 

(b)  Toss  the  coin  again.  What  is  the  probability  of  another  head?  In 
other  words  find  F(H2\Hi)  where  Hj  =  “heads  on  toss  j” 

Now  suppose  that  the  experiment  was  carried  out  as  follows:  We  select 
a  coin  at  random  and  toss  it  until  a  head  is  obtained. 

(c)  Find  F(Ci\B4)  where  —  “first  head  is  obtained  on  toss  4.” 

21.  (Computer  Experiment.)  Suppose  a  coin  has  probability  p  of  falling  heads 
up.  If  we  flip  the  coin  many  times,  we  would  expect  the  proportion  of 
heads  to  be  near  p.  We  will  make  this  formal  later.  Take  p  =  .3  and 
n  =  1,000  and  simulate  n  coin  flips.  Plot  the  proportion  of  heads  as  a 
function  of  n.  Repeat  for  p  =  .03. 

22.  (Computer  Experiment.)  Suppose  we  flip  a  coin  n  times  and  let  p  denote 
the  probability  of  heads.  Let  X  be  the  number  of  heads.  We  call  X 
a  binomial  random  variable,  which  is  discussed  in  the  next  chapter. 
Intuition  suggests  that  X  will  be  close  to  np.  To  see  if  this  is  true,  we 
can  repeat  this  experiment  many  times  and  average  the  X  values.  Carry 
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out  a  simulation  and  compare  the  average  of  the  X’s  to  np.  Try  this  for 
p  =  .3  and  n  =  10,  n  =  100,  and  n  —  1, 000. 

23.  (Computer  Experiment.)  Here  we  will  get  some  experience  simulating 
conditional  probabilities.  Consider  tossing  a  fair  die.  Let  A  =  {2, 4,  6} 
and  B  =  {1,2, 3, 4}.  Then,  P(A)  =  1/2,  P(E)  =  2/3  and  P(AB)  =  1/3. 
Since  P(AB)  =  P(A)P(S),  the  events  A  and  B  are  independent.  Simu¬ 
late  draws  from  the  sample  space  and  verify  that  P(AB)  =  P(A)P(S) 
where  P(^4)  is  the  proportion  of  times  A  occurred  in  the  simulation  and 
similarly  for  F(AB)  and  P (B).  Now  find  two  events  A  and  B  that  are  not 
independent.  Compute  P(A),P(J3)  and  P(AB).  Compare  the  calculated 
values  to  their  theoretical  values.  Report  your  results  and  interpret. 


Random  Variables 


2.1  Introduction 

Statistics  and  data  mining  are  concerned  with  data.  How  do  we  link  sample 
spaces  and  events  to  data?  The  link  is  provided  by  the  concept  of  a  random 
variable. 


2.1  Definition.  A  random  variable  is  a  mapping 1 

X  :  Q  — >>  R 

that  assigns  a  real  number  X(u)  to  each  outcome  u. 


At  a  certain  point  in  most  probability  courses,  the  sample  space  is  rarely 
mentioned  anymore  and  we  work  directly  with  random  variables.  But  you 
should  keep  in  mind  that  the  sample  space  is  really  there,  lurking  in  the 
background. 

2.2  Example.  Flip  a  coin  ten  times.  Let  X(u)  be  the  number  of  heads  in  the 
sequence  u.  For  example,  if  uo  =  HHTHHT HHTT ,  then  X{uS)  =  6.  ■ 


technically,  a  random  variable  must  be  measurable.  See  the  appendix  for  details. 
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2.3  Example.  Let  Q  =  (x,y)\  x2  +  y2  <  1  be  the  unit  disk.  Consider 

drawing  a  point  at  random  from  fh  (We  will  make  this  idea  more  precise 
later.)  A  typical  outcome  is  of  the  form  u  =  ( x ,  y ).  Some  examples  of  random 
variables  are  X(uj)  =  x ,  Y(u)  =  y ,  Z(u)  =  x  +  y,  and  W(w)  =  y/x2  +  y2.  m 

Given  a  random  variable  X  and  a  subset  A  of  the  real  line,  define  X~x(A)  = 
{cj  G  ft  :  X(u)  G  A}  and  let 

¥(X  G  A)  =  FiX-1  {A))  =  P({o;  G  fi;  X(u)  G  A}) 

F(X  =  x)  =  P^-1^))  =  P ({lj  G  Q;  X(cj)  =  x}). 

Notice  that  X  denotes  the  random  variable  and  x  denotes  a  particular  value 
of  X. 


2.4  Example.  Flip  a  coin  twice  and  let  X  be  the  number  of  heads.  Then, 
f(X  =  0)  =  P({XT})  -  1/4,  P(X  =  1)  =  P {{HT,TH})  =  1/2  and 
P(X  =  2)  =  P ({HH})  =  1/4.  The  random  variable  and  its  distribution 
can  be  summarized  as  follows: 


TT 

TH 

HT 

HH 


p(M) 

1/4 

1/4 

1/4 

1/4 


o 

i 

1 

2 


x  P(X 
0  1/4 

1  1/2 

2  1/4 


Try  generalizing  this  to  n  flips. 


2.2  Distribution  Functions  and  Probability  Functions 

Given  a  random  variable  X,  we  define  the  cumulative  distribution  function 

* 

(or  distribution  function)  as  follows. 


2.5  Definition.  The  cumulative  distribution  function,  or  CDF,  is  the 
function  Fx  :  M  — >  [0, 1]  defined  by 


Fx(x)  =F(X  <x). 


(2.1) 
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FIGURE  2.1.  CDF  for  flipping  a  coin  twice  (Example  2.6.) 


We  will  see  later  that  the  CDF  effectively  contains  all  the  information  about 
the  random  variable.  Sometimes  we  write  the  CDF  as  F  instead  of  Fx- 


2.6  Example.  Flip  a  fair  coin  twice  and  let  X  be  the  number  of  heads.  Then 
P(X  =  0)  =  P(X  =  2)  =  1/4  and  P(X  =  1)  =  1/2.  The  distribution  function 


is 


Fx(x) 


x  <  0 
0  <  x  <  1 
1  <  x  <  2 
x  >  2. 


The  CDF  is  shown  in  Figure  2.1.  Although  this  example  is  simple,  study  it 
carefully.  CDF’s  can  be  very  confusing.  Notice  that  the  function  is  right  contin¬ 
uous,  non-decreasing,  and  that  it  is  defined  for  all  x,  even  though  the  random 
variable  only  takes  values  0,1,  and  2.  Do  you  see  why  Fx(  1.4)  =  .75?  ■ 


The  following  result  shows  that  the  CDF  completely  determines  the  distri¬ 
bution  of  a  random  variable. 


2.7  Theorem.  Let  X  have  CDF  F  and  let  Y  have  CDF  G.  If  F {pc)  =  G{x)  for 
all  x,  then  ¥{X  E  A)  =  F(Y  e  A)  for  all  A.  2 

2.8  Theorem.  A  function  F  mapping  the  real  line  to  [0, 1]  is  a  CDF  for  some 
probability  P  if  and  only  if  F  satisfies  the  following  three  conditions: 

(i)  F  is  non- decreasing:  x\  <  x^  implies  that  F(x i)  <  F{x2). 

(ii)  F  is  normalized: 

lim  F{x)  =  0 

x^  —  oo 


technically,  we  only  have  that  P(X  £  A)  =  P(F  £  A)  for  every  measurable  event  A. 
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and 

lim  F(pc)  =  1. 

x— >-oo 

(iii)  F  is  right- continuous:  F(x)  =  F(x+)  for  all  x,  where 

F(x+)  =  lim  F(y). 

y — ► x 
y>x 

Proof.  Suppose  that  F  is  a  cdf.  Let  us  show  that  (iii)  holds.  Let  x  be 
a  real  number  and  let  2/i ,  2/2?  •  •  •  be  a  sequence  of  real  numbers  such  that 
Vi  >  V2  >  •  •  •  and  liuq  =  x.  Let  Ai  =  (—00,  yf  and  let  A  =  (—00,  x\.  Note 
that  A  =  a=i  ^  and  also  note  that  A\  D  A2  D  •  •  •.  Because  the  events  are 
monotone,  linpP(A^)  =P(p|^A^).  Thus, 

A,  ]  =  limP(A^)  =  limF(^)  =  F{x+). 

)  i  i 

Showing  (i)  and  (ii)  is  similar.  Proving  the  other  direction  —  namely,  that  if 
F  satisfies  (i) ,  (ii) ,  and  (iii)  then  it  is  a  CDF  for  some  random  variable  —  uses 
some  deep  tools  in  analysis.  ■ 

2.9  Definition.  X  is  discrete  if  it  takes  countably3  many  values 
{xi,X2, . . We  define  the  probability  function  or  probability  mass 
function  for  X  by  fx(x )  =  P(-X"  =  x). 


Thus,  fx{x )  P  0  for  all  x  G  R  and  ^ Zifx(xi )  =  1-  Sometimes  we  write  / 
instead  of  fx-  The  CDF  of  X  is  related  to  fx  by 

Fx(x)=F(X  <x)=  YjfxiW- 

Xi<X 

2.10  Example.  The  probability  function  for  Example  2.6  is 

1/4  x  =  0 
1/2  x  =  l 
1/4  x  =  2 
0  otherwise. 


See  Figure  2.2.  ■ 


3A  set  is  countable  if  it  is  finite  or  it  can  be  put  in  a  one-to-one  correspondence  with  the 
integers.  The  even  numbers,  the  odd  numbers,  and  the  rationals  are  countable;  the  set  of  real 
numbers  between  0  and  1  is  not  countable. 
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0  1  2  x 

FIGURE  2.2.  Probability  function  for  flipping  a  coin  twice  (Example  2.6). 


2.11  Definition.  A  random  variable  X  is  continuous  if  there  exists  a 
function  fx  such  that  fx(x)  >  0  for  all  x,  fx(x)dx  =  1  and  for 
every  a  <  b, 

P(a  <  X  <  b)  =  I  fx(x)dx.  (2-2) 

J  a 

The  function  fx  is  called  the  probability  density  function  (pdf).  We 
have  that 

/. X 

fx(t)dt 

-oo 

and  fx(x)  =  Fx(x)  at  all  points  x  at  which  Fx  is  differentiable. 


Sometimes  we  write  f  f{pc)dx  or  f  f  to  mean  f{x)dx. 
2.12  Example.  Suppose  that  X  has  pdf 


fx{x) 


1  for  0  <  x  <  1 
0  otherwise. 


Clearly,  fx(x)  >  0  and  f  fx(x)dx  =  1.  A  random  variable  with  this  density 
is  said  to  have  a  Uniform  (0,1)  distribution.  This  is  meant  to  capture  the  idea 
of  choosing  a  point  at  random  between  0  and  1.  The  CDF  is  given  by 

f  0  x  <  0 

Fx(x)  =  l  x  0<x<l 

I  1  X  >  1. 


See  Figure  2.3.  ■ 
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FIGURE  2.3.  CDF  for  Uniform  (0,1). 


2.13  Example.  Suppose  that  X  has  pdf 


f(x)  = 


0 


(l+x): 


for  x  <  0 
otherwise. 


Since  f  f(x)dx  =  1,  this  is  a  well-defined  PDF.  ■ 

Warning!  Continuous  random  variables  can  lead  to  confusion.  First,  note 
that  if  X  is  continuous  then  ¥(X  =  x)  =  0  for  every  x.  Don’t  try  to  think 
of  f(x)  as  P(X  =  x).  This  only  holds  for  discrete  random  variables.  We  get 
probabilities  from  a  PDF  by  integrating.  A  PDF  can  be  bigger  than  1  (unlike 
a  mass  function).  For  example,  if  f(x)  =  5  for  x  G  [0, 1/5]  and  0  otherwise, 
then  f(x)  >  0  and  f  f(x)dx  =  1  so  this  is  a  well-defined  PDF  even  though 
f(x)  =  5  in  some  places.  In  fact,  a  PDF  can  be  unbounded.  For  example,  if 
f(x)  =  (2/3)x-1/3  for  0  <  x  <  1  and  f{x)  =  0  otherwise,  then  f  f(pc)dx  =  1 
even  though  /  is  not  bounded. 


2.14  Example.  Let 


f{x)  = 


0 


(1  +  x) 


for  x  <  0 
otherwise. 


This  is  not  a  PDF  since  f  f(x)dx  =  /0°°  dx/{  1  +x)  =  f^°  du/u  =  log(oo)  =  oo. 


2.15  Lemma.  Let  F  be  the  CDF  for  a  random  variable  X .  Then: 
1.  P(X  =  x)  =  F(pc)  —  F(x~)  where  F(x~)  =  lim^  F(y ); 


2.3  Some  Important  Discrete  Random  Variables 
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2.  P(x  <  X  <  y)  =  F(y)  —  F{x); 

3.  ¥(X  >  x)  =  1  —  F(x); 

4-  ifx  is  continuous  then 

F(b)  -  F(a)  =  P(a  <  X  <  b)  =  P(a  <  X  <b) 

=  P(a  <  X  <  b)  =  P(a  <  X  <  b). 

It  is  also  useful  to  define  the  inverse  CDF  (or  quantile  function). 


2.16  Definition.  Let  X  be  a  random  variable  with  CDF  F.  The  inverse 
CDF  or  quantile  function  is  defined  by 4 

F_1(g)  =  inf :  F(x)  >  g  j 

for  q  G  [0, 1].  If  F  is  strictly  increasing  and  continuous  then  F~1(q)  is  the 
unique  real  number  x  such  that  F(x)  =  q. 


We  call  F  1  (1/ 4)  the  first  quartile,  F  1  (1  /2)  the  median  (or  second 
quartile),  and  F-1( 3/4)  the  third  quartile. 

Two  random  variables  X  and  Y  are  equal  in  distribution  —  written 
X  =  Y  —  if  Fx(x)  =  Fy(x)  for  all  x.  This  does  not  mean  that  X  and  Y  are 
equal.  Rather,  it  means  that  all  probability  statements  about  X  and  Y  will 
be  the  same.  For  example,  suppose  that  ¥(X  =  1)  =  P(X  =  —1)  =  1/2.  Let 
Y  =  -X.  Then  P(Y  =  1)  =  P(Y  =  -1)  =  1/2  and  so  X  =  Y.  But  X  and  Y 
are  not  equal.  In  fact,  P(X  =  Y)  —  0. 


2.3  Some  Important  Discrete  Random  Variables 

Warning  About  Notation!  It  is  traditional  to  write  X  ~  F  to  indicate 
that  X  has  distribution  F.  This  is  unfortunate  notation  since  the  symbol  ^ 
is  also  used  to  denote  an  approximation.  The  notation  X  ^  F  is  so  pervasive 
that  we  are  stuck  with  it.  Read  X  ^  F  as  UX  has  distribution  F”  not  as  UX 
is  approximately  F” . 


4 


If  you  are  unfamiliar  with 


“inf” ,  just  think  of  it  as  the  minimum. 
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The  Point  Mass  Distribution.  X  has  a  point  mass  distribution  at  a, 
written  X  ^  5a,  if  P(X  =  a)  =  1  in  which  case 

1^/  N  f  0  x  <  a 

F(x )  =  {  ,  ^ 

v  1  x  >  a. 

The  probability  mass  function  is  f(x)  =  1  for  x  =  a  and  0  otherwise. 


The  Discrete  Uniform  Distribution.  Let  k  >  1  be  a  given  integer. 
Suppose  that  X  has  probability  mass  function  given  by 


1  jk  for  x  =  1, . . . ,  k 
0  otherwise. 


We  say  that  X  has  a  uniform  distribution  on  {1 


The  Bernoulli  Distribution.  Let  X  represent  a  binary  coin  flip.  Then 
P(X  —  1)  —  p  and  P(X  =  0)  =  1  —  p  for  some  p  E  [0, 1].  We  say  that  X  has  a 
Bernoulli  distribution  written  X  ~  Bernoulli (p).  The  probability  function  is 
f(x)  =  px{  1  —  pY~x  for  x  E  {0, 1}. 


The  Binomial  Distribution.  Suppose  we  have  a  coin  which  falls  heads 
up  with  probability  p  for  some  0  <  p  <  1.  Flip  the  coin  n  times  and  let 
X  be  the  number  of  heads.  Assume  that  the  tosses  are  independent.  Let 
f(x)  =  P(X  =  x)  be  the  mass  function.  It  can  be  shown  that 

(™)px(  1  —  p)n~x  for  x  =  0, . . . ,  n 
0  otherwise. 


A  random  variable  with  this  mass  function  is  called  a  Binomial  random 
variable  and  we  write  X  ^  Binomial(n,p).  If  X\  ~  Binomial(ni,p)  and 
X2  ~  Binomial(n2,p)  then  X\  +  X2  ~  Binomial(ni  +  712, p). 


Warning!  Let  us  take  this  opportunity  to  prevent  some  confusion.  X  is  a 
random  variable;  x  denotes  a  particular  value  of  the  random  variable;  n  and  p 
are  parameters,  that  is,  fixed  real  numbers.  The  parameter  p  is  usually  un¬ 
known  and  must  be  estimated  from  data;  that’s  what  statistical  inference  is  all 
about.  In  most  statistical  models,  there  are  random  variables  and  parameters: 
don’t  confuse  them. 


The  Geometric  Distribution.  X  has  a  geometric  distribution  with 
parameter  p  E  (0, 1),  written  X  ~  Geom(p),  if 

F(X  =  k)=  p(l  -pf-1 


1 


k  >  1. 
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We  have  that 

oo  oo 

£P(A-  =  t)=p£(l-i,)*=  *  =1. 

k=l  k=l  V 

Think  of  X  as  the  number  of  flips  needed  until  the  first  head  when  flipping  a 
coin. 


The  Poisson  Distribution.  X  has  a  Poisson  distribution  with  parameter 
A,  written  X  ~  Poisson(A)  if 


x  >  0. 


Note  that 


oo 


Nut)  = 


-A 


^  \x 

T- 


x=0 


x=0 


1. 


The  Poisson  is  often  used  as  a  model  for  counts  of  rare  events  like  radioactive 
decay  and  traffic  accidents.  If  X\  ~  Poisson(Ai)  and  X 2  ^  Poisson(A2)  then 
X\  +  X2  ~  Poisson(Ai  +  A2). 


Warning!  We  dehned  random  variables  to  be  mappings  from  a  sample 
space  U  to  M  but  we  did  not  mention  the  sample  space  in  any  of  the  distri¬ 
butions  above.  As  I  mentioned  earlier,  the  sample  space  often  “disappears” 
but  it  is  really  there  in  the  background.  Let’s  construct  a  sample  space  ex¬ 
plicitly  for  a  Bernoulli  random  variable.  Let  Q  =  [0, 1]  and  define  P  to  satisfy 
P([a,  b\)  =  b  —  a  for  0  <  a  <  b  <  1.  Fix  p  E  [0, 1]  and  define 


X(uj) 


1  u  <  p 
0  u  >  p. 


Then  P(X  =  1)  =  P(cj  <  p)  =  P([0,p])  =  p  and  P(X  =  0)  =  1  —  p.  Thus, 
X  ~  Bernoulli(p).  We  could  do  this  for  all  the  distributions  dehned  above.  In 
practice,  we  think  of  a  random  variable  like  a  random  number  but  formally  it 
is  a  mapping  dehned  on  some  sample  space. 


2.4  Some  Important  Continuous  Random  Variables 


The  Uniform  Distribution.  X  has  a  Uniform(a,6)  distribution,  written 
X  ~  Uniform  (a,  6),  if 


1 


b—a 

0 


for  x  E  [a,  b 


otherwise 
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where  a  <  b.  The  distribution  function  is 


F(x) 


0 

x  —  a 
b—a 

1 


x  <  a 
x  G  [a,  b } 
x  >  b. 


Normal  (Gaussian).  X  has  a  Normal  (or  Gaussian)  distribution  with 
parameters  fi  and  cr,  denoted  by  X  N (/i,  cr2),  if 

1 

ay^2ir 

where  /x  G  M  and  cr  >  0.  The  parameter  fi  is  the  “center”  (or  mean)  of  the 
distribution  and  a  is  the  “spread”  (or  standard  deviation)  of  the  distribu¬ 
tion.  (The  mean  and  standard  deviation  will  be  formally  defined  in  the  next 
chapter.)  The  Normal  plays  an  important  role  in  probability  and  statistics. 
Many  phenomena  in  nature  have  approximately  Normal  distributions.  Later, 
we  shall  study  the  Central  Limit  Theorem  which  says  that  the  distribution  of 
a  sum  of  random  variables  can  be  approximated  by  a  Normal  distribution. 

We  say  that  X  has  a  standard  Normal  distribution  if  fi  =  0  and  a  =  1. 
Tradition  dictates  that  a  standard  Normal  random  variable  is  denoted  by  Z. 
The  pdf  and  CDF  of  a  standard  Normal  are  denoted  by  cj)(z)  and  <F(z).  The 
pdf  is  plotted  in  Figure  2.4.  There  is  no  closed- form  expression  for  <f>.  Here 
are  some  useful  facts: 

(i)  If  X  -  ]V(/i,  CT2),  then  Z  =  (X  -  /i)/cr  -  N( 0, 1). 

(ii)  If  Z  ~  N( 0, 1),  then  X  =  n  +  aZ  ^  7V(/i,  cr2). 

(iii)  If  Xi  ~  iV(/ii,  cr2),  i  =  l,...,n  are  independent,  then 

n  /  n  n 

Xi~N( 

i=  1  \i=  1  i=l 

It  follows  from  (i)  that  if  X  ~  X(/i,cr2),  then 

P  (a  <  X  <  b) 


1 


exp 


2cr2 


(x  —  n)‘ 


(2.3) 


Thus  we  can  compute  any  probabilities  we  want  as  long  as  we  can  compute 
the  CDF  &(z)  of  a  standard  Normal.  All  statistical  computing  packages  will 
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-2-10  1  2 
FIGURE  2.4.  Density  of  a  standard  Normal. 

compute  <F(T)  and  <F_1(g).  Most  statistics  texts,  including  this  one,  have  a 
table  of  values  of  <F(T). 


2.17  Example.  Suppose  that  X  ~  jV( 3,5).  Find  P(X  >  1).  The  solution  is 


¥(X  >  1)  =  1  -  ¥(X  <  1)  =  1  -¥[Z  < 


1-3 

VT 


1  -  $(-0.8944)  =  0.81 


Now  hnd  q  =  <F  1(0.2).  This  means  we  have  to  find  q  such  that  ¥(X  <  q) 
0.2.  We  solve  this  by  writing 


0.2  =  P(X  <  q)  =  P  (  Z  < 


q- /I 


q- li 


From  the  Normal  table,  <F(— 0.8416)  =  0.2.  Therefore, 


-0.8416  = 


q  —  ii  q  —  3 
cr  -x/5 


and  hence  q  =  3  —  0.8416^/5  =  1.1181 


Exponential  Distribution.  X  has  an  Exponential  distribution  with 
parameter  /?,  denoted  by  X  Exp(/3),  if 

f(x)  =  —e~x^,  x  >  0 

P 

where  (3  >  0.  The  exponential  distribution  is  used  to  model  the  lifetimes  of 
electronic  components  and  the  waiting  times  between  rare  events. 

Gamma  Distribution.  For  a  >  0,  the  Gamma  function  is  defined  by 
UN  =  Jo  ya~1e~  vdy.  X  has  a  Gamma  distribution  with  parameters  a  and 
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/?,  denoted  by  X  ~  Gamma  (a, /?),  if 

/(x)  =  ^  w  ,xa~1e~x^ ,  x  >  0 

JV  7  /v*r(a) 

where  a, /?  >  0.  The  exponential  distribution  is  just  a  Gamma(l,/3)  distribu¬ 
tion.  If  ^  Gamma(o!?,  /?)  are  independent,  then  Yh=i  ~  Gamma(^^=1  cu,  (3). 


The  Beta  Distribution.  X  has  a  Beta  distribution  with  parameters 
a  >  0  and  (3  >  0,  denoted  by  X  ~  Beta(<a,  /?),  if 

/(I>  =  F§)T(|'T°'1(1“I>'5'1'  0<x<1- 


t  and  Cauchy  Distribution.  X  has  a  t  distribution  with  v  degrees  of 
freedom  —  written  X  ~  tv  —  if 


f(x)  = 


r 


2  ) 


1 


r  (f)  (i  +  y)("+1)/2 


The  t  distribution  is  similar  to  a  Normal  but  it  has  thicker  tails.  In  fact,  the 
Normal  corresponds  to  a  t  with  v  =  oo.  The  Cauchy  distribution  is  a  special 
case  of  the  t  distribution  corresponding  to  v  =  1.  The  density  is 


/o 


To  see  that  this  is  indeed  a  density: 


1 


7r(l  +  X2) 


>oo 


f{x)dx 


1 


>oo 


dx 


—  oo 


7T  ./_oo  1  +  X[ 


1 
7 r 


>oo 


—  oo 


dtan  1(x) 


dx 


1 

tan  1(oo)— tan  1(— oo) 

1 

"7T 

(-?)i 

7 T 

\  /  \  / 

7 T 

.2 

V  2/ J 

1 


The  y2  distribution.  X  has  a  y2  distribution  with  p  degrees  of  freedom 
written  X  ~  y 2  —  if 


f(x)  = 


1 


T(p/2)2P/2 


X 


(Pi  2)-le-x/2  x>0i 


If  Zi, . . . ,  Zp  are  independent  standard  Normal  random  variables  then  Y 


p 


rsj 
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2.5  Bivariate  Distributions 

Given  a  pair  of  discrete  random  variables  X  and  Y ,  define  the  joint  mass 
function  by  /(x,  y)  =  P(X  =  x  and  Y  =  y).  From  now  on,  we  write  P(X  = 
x  and  Y  =  y)  as  P(X  =  x,  Y  =  y).  We  write  /  as  fx,Y  when  we  want  to  be 
more  explicit. 


2.18  Example.  Here  is  a  bivariate  distribution  for  two  random  variables  X 
and  Y  each  taking  values  0  or  1: 


Y  =  0  Y  =  1 

x=o 

1/9 

2/9 

1/3 

X=1 

2/9 

4/9 

2/3 

1/3 

2/3 

1 

Thus,  /( 1, 1)  =  P(X  =  1,  Y  =  1)  =  4/9.  ■ 


2.19  Definition.  In  the  continuous  case ,  we  call  a  function  f{x,y)  a  PDF 
for  the  random  variables  (X,  T)  if 

(i)  f(x,y)  >  0  for  all  (x,y), 

(U)  IZ  fZ  f(x,  y)dxdy  =  1  and , 

(Hi)  for  any  set  A  C  R  x  R,  P((X,  Y)  G  A)  =  f  fA  /(x,  y)dxdy. 


In  the  discrete  or  continuous  case  we  define  the  joint  CDF  as  Fx,y(x,y)  = 

p(x  <  x,  y  <  i/). 

2.20  Example.  Let  (X,  Y)  be  uniform  on  the  unit  square.  Then, 


f(x,y) 


1  if  0  <  x  <  1,  0  <  y  <  1 
0  otherwise. 


Find  P(X  <  1/2,  Y  <  1/2).  The  event  A  =  {X  <  1/2,  Y  <  1/2}  corresponds 
to  a  subset  of  the  unit  square.  Integrating  /  over  this  subset  corresponds,  in 
this  case,  to  computing  the  area  of  the  set  A  which  is  1/4.  So,  IP(X"  <5  1/2,  Y  < 
1/2)  =  1/4.  ■ 
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2.21  Example.  Let  (X,  Y)  have  density 


f(x,y) 


x  +  y  if  0  <  x  <  1 ,  0  <  y  <  1 
0  otherwise. 


Then 


(x  +  y)dxdy 


l 

x  dx 


o 


IV  o 


dy  + 


1 


which  verifies  that  this  is  a  PDF  ■ 


2.22  Example.  If  the  distribution  is  defined  over  a  non-rectangular  region, 
then  the  calculations  are  a  bit  more  complicated.  Here  is  an  example  which  I 
borrowed  from  DeGroot  and  Schervish  (2002).  Let  (X,  Y)  have  density 


f(x,y) 


cx2y  if  x2  <  y  <  1 
0  otherwise. 


Note  first  that  —  1  <  x  <  1.  Now  let  us  find  the  value  of  c.  The  trick  here  is 
to  be  careful  about  the  range  of  integration.  We  pick  one  variable,  x  say,  and 
let  it  range  over  its  values.  Then,  for  each  fixed  value  of  x,  we  let  y  vary  over 
its  range,  which  is  x2  <  y  <  1.  It  may  help  if  you  look  at  Figure  2.5.  Thus, 


1 


4c 

21’ 


Hence,  c  =  21/4.  Now  let  us  compute  P(X  >  Y).  This  corresponds  to  the  set 
A  =  {(x,  y);  0  <  x  <  1 ,  x2  <  y  <  x}.  (You  can  see  this  by  drawing  a  diagram.) 
So, 


P(X  >  Y) 


21 

T 

21 

T 


x2  y  dy  dx 


21 

T 


x‘ 


X 


0 


2 


dx 


3 

20’ 


X 

ydy 


X‘ 


dx 
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0  1 

x 


FIGURE  2.5.  The  light  shaded  region  is  x2  <  y  <  1.  The  density  is  positive  over 
this  region.  The  hatched  region  is  the  event  X  >  Y  intersected  with  x  <  y  <  1. 

2.6  Marginal  Distributions 

2.23  Definition.  If  (X,Y )  have  joint  distribution  with  mass  function 
fx.Y,  then  the  marginal  mass  function  for  X  is  defined  by 

fx(x)  =  V(X  =  x)  =  =x,Y  =  y)  =  5^/(x,  y)  (2.4) 

y  y 

and  the  marginal  mass  function  for  Y  is  defined  by 

fY(y)  =  F(Y  =  y)  =  Yi  HX  =  x,  Y  =  y)  =  £  f(x,  y).  (2.5) 

ry  ry 

2.24  Example.  Suppose  that  fx,Y  is  given  in  the  table  that  follows.  The 
marginal  distribution  for  X  corresponds  to  the  row  totals  and  the  marginal 
distribution  for  Y  corresponds  to  the  columns  totals. 


Y =0  Y=1 

x=o 

1/10 

2/10 

3/10 

X=1 

3/10 

4/10 

7/10 

4/10 

6/10 

1 

For  example,  fx( 0)  =  3/10  and  fx(  1)  =  7/10.  ■ 
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2.25  Definition.  For  continuous  random  variables ,  the  marginal  densities 
are 

fx{x)  =  j  f(x,y)dy,  and  fY(y)  =  J  f(x,y)dx.  (2.6) 
The  corresponding  marginal  distribution  functions  are  denoted  by  Fx  and 

Fy 


2.26  Example.  Suppose  that 

fx,r{x,y)  =  e~[x+v) 

for  x,  y  >  0.  Then  fx(x)  =  e~x  J0°°  e~vdy  =  e~x .  u 


2.27  Example.  Suppose  that 


f{x,y)  = 


x  +  y  if  0  <  x  <  1 ,  0  <  y  <  1 
0  otherwise. 


Then 


fy(y)  =  /  (xFy)dx=  xdx  + 

Jo  Jo  Jo 


=  /  xdx-\-  y  dx  = — \-y. 

Jo  Jo  2 


2.28  Example.  Let  (X,  Y)  have  density 


£(  \  f  xrx2y  if  x2  <  y  <  1 

/(x’y)=  0  otherwise. 


Thus. 


fx  (x)  =  J  f(x,  y)dy  =  ^ x 2  J  ^  y 


dy  =  — x2(l  —  x4) 
8 


for  —1  <  x  <  1  and  fx(x)  =  0  otherwise. 


2.7  Independent  Random  Variables 


2.29  Definition.  Two  random  variables  X  and  Y  are  independent  if 
for  every  A  and  B , 

¥(X  e  A,  Y  e  B)  =  P(X  g  A)¥(Y  e  B)  (2.7) 

and  we  write  X  II  Y .  Otherwise  we  say  that  X  and  Y  are  dependent 
and  we  write  X  Y . 


2.7  Independent  Random  Variables 
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In  principle,  to  check  whether  X  and  Y  are  independent  we  need  to  check 
equation  (2.7)  for  all  subsets  A  and  B.  Fortunately,  we  have  the  following 
result  which  we  state  for  continuous  random  variables  though  it  is  true  for 
discrete  random  variables  too. 

2.30  Theorem.  Let  X  and  Y  have  joint  pdf  fx,Y-  Then  X  1 1Y  if  and  only 
if  fx,y{x,  v)  =  fx{x)fY(y)  for  all  values  x  and  y.  5 


2.31  Example.  Let  X  and  Y  have  the  following  distribution: 


Y  =  0  Y  =  1 

x=o 

1/4 

1/4 

1/2 

X=1 

1/4 

1/4 

1/2 

1/2 

!/2 

1 

Then,  fx( 0)  =  fx(  1)  =  1/2  and  /v(0)  =  /v(l)  =  1/2.  X  and  Y  are  inde- 
pendent  because  /x(0)/y(0)  =  /(0,0),  /x(0)/y(l)  =  /( 0,1),  /x(1)/y(0)  = 
/( 1,0),  f x (l)/v (1)  =  /( 1,1).  Suppose  instead  that  X  and  Y  have  the  follow¬ 
ing  distribution: 


Y  =  0 

Y  =  1 

x=o 

1/2 

0 

1/2 

X=1 

0 

1/2 

1/2 

1/2 

1/2 

1 

These  are  not  independent  because  /x(0)/v(  1)  =  (1/2)  (1/2)  =  1/4  yet 

/(0,1)  =  0.  - 


2.32  Example.  Suppose  that  X  and  Y  are  independent  and  both  have  the 
same  density 

x  (  2x  if  0  <  x  <  1 
J\x)  |  q  otherwise. 

Let  us  find  P(X  +  Y  <  1).  Using  independence,  the  joint  density  is 


f(x,y)  =  fx{x)fY{y) 


4  xy  if0<x<l,  0  <  y  <1 
0  otherwise. 


5The  statement  is  not  rigorous  because  the  density  is  defined  only  up  to  sets  of 
measure  0. 


36 


2.  Random  Variables 


The  following  result  is  helpful  for  verifying  independence. 

2.33  Theorem.  Suppose  that  the  range  of  X  and  Y  is  a  (possibly  infinite) 
rectangle.  If  f(x,y)  =  g(x)h(y)  for  some  functions  g  and  h  (not  necessarily 
probability  density  functions)  then  X  andY  are  independent. 


2.34  Example.  Let  X  and  Y  have  density 


f(x,y) 


2e  (a:+2s/)  if  x  >  0  and  y  >  0 
0  otherwise. 


The  range  of  X  and  Y  is  the  rectangle  (0,  oo)  x  (0,  oo).  We  can  write  /(#,  y)  = 
g(x)h(y)  where  g(x)  =  2e~x  and  h(y)  =  e~2y .  Thus,  X  II  Y.  u 


2.8  Conditional  Distributions 

If  X  and  Y  are  discrete,  then  we  can  compute  the  conditional  distribution  of 
X  given  that  we  have  observed  Y  =  y.  Specifically,  P(X  =  x  I  Y  =  y)=  ¥(X  = 
x,Y  =  y)/¥(Y  =  y).  This  leads  us  to  define  the  conditional  probability  mass 
function  as  follows. 


2.35  Definition.  The  conditional  probability  mass  function  is 


fxlY(x\y)=¥(X  =  x\Y  =  y) 


P(X  =  x,Y  =  y)  fx,y(x,  y) 


P(Y  =  y) 


fr(y) 


if  fr(y)  >  0. 


For  continuous  distributions  we  use  the  same  definitions.  6  The  interpre¬ 
tation  differs:  in  the  discrete  case,  fx\Yix\y)  is  P(V  =  x\Y  =  y),  but  in  the 
continuous  case,  we  must  integrate  to  get  a  probability. 

6We  are  treading  in  deep  water  here.  When  we  compute  P(X  £  A\Y  =  y)  in  the 
continuous  case  we  are  conditioning  on  the  event  {Y  =  y}  which  has  probability  0.  We 


2.8  Conditional  Distributions 
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2.36  Definition.  For  continuous  random  variables ,  the  conditional 
probability  density  function  is 


fx\y{x\y) 


fx,y{x,y) 

fy(y) 


assuming  that  fy(y)  >  0.  Then , 


P(X  g  A\Y  =  y)  =  /  fx\Y(x\y)dx. 

J  A 


2.37  Example.  Let  X  and  Y  have  a  joint  uniform  distribution  on  the  unit 
square.  Thus,  fx\y(x\y)  =  1  f°r  0  <  x  <  1  and  0  otherwise.  Given  Y  =  y,  X 
is  Uniform(0, 1).  We  can  write  this  as  X\Y  =  y  ~  Uniform(0, 1).  ■ 

From  the  dehnition  of  the  conditional  density,  we  see  that  fx,Y(x,y)  = 
fx\Y(x\y)fY(y)  =  fY\x(y\x)fx(x)-  This  can  sometimes  be  useful  as  in  exam¬ 
ple  2.39. 


2.38  Example.  Let 

f(x,y) 


x  +  y  if  0  <  x  <  1 ,  0  <y  <1 
0  otherwise. 


Let  us  hnd  P(X  <  1/4|T  =  1/3).  In  example  2.27  we  saw  that  fy(y)  = 
y  +  (1/2).  Hence, 


fx\Y(x\y) 


fx,r(x,y) 

fr(y) 


x  +  y 

y  +  Y 


2.39  Example.  Suppose  that  X  ~  Uniform(0, 1).  After  obtaining  a  value  of 
X  we  generate  Y\X  =  x  ~  Uniform (x,  1).  What  is  the  marginal  distribution 


avoid  this  problem  by  defining  things  in  terms  of  the  pdf.  The  fact  that  this  leads  to 
a  well-defined  theory  is  proved  in  more  advanced  courses.  Here,  we  simply  take  it  as  a 
definition. 
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of  Y?  First  note  that, 


fx(x) 


1  if  0  <  x  <  1 
0  otherwise 


and 


fY\x{y\%) 


T —  if  0<x<y<l 

1—x  & 

0  otherwise. 


fx,v(x,y)  =  fY\x(y\x)fx(x) 


p —  ifO<x<j/<l 

1  —  X  & 

0  otherwise. 


The  marginal  for  Y  is 


ry 

fv(y)  =  /  fx,v{x,y)dx 
Jo 

for  0  <  y  <  1.  ■ 


y  dx 


o  1  -X 


f'1  v  du 
1  u 


log(l  -  y) 


2.40  Example.  Consider  the  density  in  Example  2.28.  Let’s  find  fy\x{y\x)- 
When  X  =  x,  y  must  satisfy  x2  <  y  <  1.  Earlier,  we  saw  that  fx(x)  = 
(21/8)x2(1  —  x4).  Hence,  for  x2  <  y  <  1, 

/in  f(x,y)  Tx2y  =  2v 
xyx  fx{x)  ^x2(l-x4)  1-x4' 


Now  let  us  compute  F(V  >  3/4|X  =  1/2).  This  can  be  done  by  first  noting 
that  fy\x  (2/ 1 1  /2)  =  32y/15.  Thus, 


¥(Y  >  3/4|X  =  1/2) 


3/4 


f(y\l/2)dy 


2.9  Multivariate  Distributions  and  iid  Samples 

Let  X  =  (Xi, . . . ,  Xn)  where  Xi, . . . ,  Xn  are  random  variables.  We  call  X  a 
random  vector.  Let  f(x i,...,xn)  denote  the  PDF.  It  is  possible  to  define 
their  marginals,  conditionals  etc.  much  the  same  way  as  in  the  bivariate  case. 
We  say  that  Xi, . . . ,  Xn  are  independent  if,  for  every  A 1, . . . ,  Ani 

n 

P(Xi  G  Ai, . . . ,  Xn  G  An)  =  Y\ G  Aj). 

i—  1 

It  suffices  to  check  that  f{pc  1, . . . ,  xn)  =  fllLi  /x(x0- 


(2.8) 


2.10  Two  Important  Multivariate  Distributions 
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2.41  Definition.  If X i,...  ,  Xn  are  independent  and  each  has  the  same 
marginal  distribution  with  CDF  F,  we  say  that  Xi, . . . ,  Xn  are  HD 
(independent  and  identically  distributed)  and  we  write 

Xu  . . .  Xn  F. 

If  F  has  density  f  we  also  write  Xu  •  •  •  Xn  ~  f.  We  also  call  Xu  •  •  • ,  Xn 

a  random  sample  of  size  n  from  F . 


Much  of  statistical  theory  and  practice  begins  with  HD  observations  and  we 
shall  study  this  case  in  detail  when  we  discuss  statistics. 

2.10  Two  Important  Multivariate  Distributions 

Multinomial.  The  multivariate  version  of  a  Binomial  is  called  a  Multino¬ 
mial.  Consider  drawing  a  ball  from  an  urn  which  has  balls  with  k  different 
colors  labeled  “color  1,  color  2,  ...,  color  k.”  Let  p  =  (pi,...,pfc)  where 
Pj  >  0  and  Y^kj=iVj  =  1  and  suppose  that  pj  is  the  probability  of  drawing 
a  ball  of  color  j.  Draw  n  times  (independent  draws  with  replacement)  and 
let  X  =  (Xi, . . . ,  X&)  where  Xj  is  the  number  of  times  that  color  j  appears. 
Hence,  n  =  Y^=iXj.  We  say  that  X  has  a  Multinomial  (n,p)  distribution 
written  X  ~  Muffinomial(n,p).  The  probability  function  is 

«*>=(*."  -  ^  (2-9> 

where 

/  n  \  n! 

\xi  . . .  Xk)  x\\  •  •  • 

2.42  Lemma.  Suppose  that  X  ~  Multinomial(n,p)  where  X  =  (Xi, . . . ,  X^) 
and  p  =  (pi, . . .  ,p/c).  The  marginal  distribution  of  Xj  is  Binomial  (n,pj). 

Multivariate  Normal.  The  univariate  Normal  has  two  parameters,  fi 
and  a.  In  the  multivariate  version,  /i  is  a  vector  and  a  is  replaced  by  a  matrix 
E.  To  begin,  let 
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where  Zi, . . . ,  Z&  ^  X(0, 1)  are  independent.  The  density  of  Z  is  7 


/(*)  =  n/fe)=(^exp|4^^2} 

=  (sVWT 

We  say  that  Z  has  a  standard  multivariate  Normal  distribution  written  Z  ^ 
X(0, 1)  where  it  is  understood  that  0  represents  a  vector  of  k  zeroes  and  I  is 
the  k  x  k  identity  matrix. 

More  generally,  a  vector  X  has  a  multivariate  Normal  distribution,  denoted 
by  X  ^  X(/i,  E),  if  it  has  density  8 


where  |E|  denotes  the  determinant  of  E,  fi  is  a  vector  of  length  k  and  E  is  a 
k  x  k  symmetric,  positive  definite  matrix.  9  Setting  fi  =  0  and  E  =  I  gives 
back  the  standard  Normal. 

Since  E  is  symmetric  and  positive  definite,  it  can  be  shown  that  there  exists 
a  matrix  E1/2  —  called  the  square  root  of  E  —  with  the  following  properties: 
(i)  E1/2  is  symmetric,  (ii)  E  =  E1/2E1/2  and  (iii)  E1/2E-1/2  =  E_1/2E1/2  =  I 
where  E-1/2  =  (E1/2)-1. 

2.43  Theorem.  If  Z  ~  X(0,7)  and  X  =  /i  +  EV2Z  then  X  ~  X(/i, E). 
Conversely ,  ifX~N(fi,  E),  then  E_1/2(X  —  fi)  ^  X(0, 1). 


f(x;  /i,E) 


1 


(27r)/c/2|(E)|1/2 


exp 


1, 

-  x 
2 v 


/i)TE  1(x  —  fi) 


Suppose  we  partition  a  random  Normal  vector  X  as  X  =  (Xa,X&)  We  can 
similarly  partition  fi  =  (fia,  hb)  and 

E aa  E ab 

E  ba  E  bb 

2.44  Theorem.  Tet  X  ^  X(/i,  E).  Then: 

(1)  The  marginal  distribution  of  Xa  is  Xa  ~  X(/ia,  Eaa). 

(^)  T/ie  conditional  distribution  of  Xb  given  Xa  =  xa  is 

Xb\Xa  —  Xa  ~  X  (  fib  T  E^aEaa  (xa  ha)')  E bb  E5aEaa  Ea6  )  . 

If  a  is  a  vector  then  aT X  ^  X(aT/i,  aTEa). 

(4)  v  =  (x  -  h)Tz-\x  -  fi)  ~  xl 


7 If  a  and  b  are  vectors  then  a1b=^=1aIh. 

8E  1  is  the  inverse  of  the  matrix  E. 

9A  matrix  X  is  positive  definite  if,  for  all  nonzero  vectors  x  j  (Z/  ^  (Z/  ^  0 . 
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2.11  Transformations  of  Random  Variables 


Suppose  that  X  is  a  random  variable  with  PDF  fx  and  CDF  Fx-  Let  Y  =  r(X) 
be  a  function  of  X,  for  example,  Y  =  X2  or  Y  =  ex .  We  call  Y  =  r(X)  a 
transformation  of  X.  How  do  we  compute  the  PDF  and  CDF  of  Y?  In  the 
discrete  case,  the  answer  is  easy.  The  mass  function  of  Y  is  given  by 

fy(v)  =  P(y  =  v)  =  P(rpf)  =  y) 

=  P({x;  r(x)  =  y})  =  P(X  s  r~1(y)). 


2.45  Example.  Suppose  that  P(X  =  —1)  =  P(X  =  1)  =  1/4  and  P(X  =  0)  = 
1/2.  Let  Y  =  X2.  Then,  P(Y  =  0)  =  P(X  =  0)  =  1/2  and  P(F  =  1)  =  P(J\T  = 


1)  +  P(X  =  —1)  =  1/2.  Summarizing: 

x  fx ( x ) 

-1  1/4 

0  1/2 

1  1/4 


y  fr{y) 
o  1/2 
1  1/2 


Y  takes  fewer  values  than  X  because  the  transformation  is  not  one-to-one.  ■ 


The  continuous  case  is  harder.  There  are  three  steps  for  finding  /y: 


Three  Steps  for  Transformations 

1.  For  each  y,  find  the  set  Ay  =  {x  :  r(x)  <  y}. 

2.  Find  the  CDF 


Fy(y) 


P(Y  <y)=  P(r(X)  <  y) 
P({x;  r(x)  <  y}) 


3.  The  pdf  is  fY{y)  =  F{r{y). 


(2.11) 


2.46  Example.  Let  fx(x)  =  e  x  for  x  >  0.  Hence,  Fx{x)  =  fx(s)ds  = 
1  —  e~x .  Let  Y  =  r(X)  =  logX.  Then,  Ay  =  {x  :  x  <  ey}  and 

Fy(y)  =  P(Y  <  y)  =  P(logX  <  y) 

=  P(X  <  ev )  =  Fx  (ev )  =  1  -  e“e" . 


Therefore,  fy(y)  =  eV  for  y  £  K.  ■ 
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2.47  Example.  Let  X 

density  of  X  is 


Uniform(— 1, 3).  Find  the  PDF  of  Y  =  X2.  The 


fx(x) 


1/4  if  -  1  <  x  <  3 
0  otherwise. 


Y  can  only  take  values  in  (0,  9).  Consider  two  cases:  (i)  0  <  y  <  1  and  (ii)  1  < 
y  <  9.  For  case  (i)>  Ay  =  [~y/y,y/y\  and  FY(y)  =  fAy  fx(x)dx  =  (1/2 )jy. 
For  case  (ii),  Av  =  [~l,^/y\  and  FY(y)  =  JAyfx{x)dx  =  (1/4 ){y/y  +  1). 
Differentiating  F  we  get 


fy(y)  = 


1 

4  Fv 
1 


if  0  <  y  <  1 
if  1  <  y  <  9 
otherwise,  i 


When  r  is  strictly  monotone  increasing  or  strictly  monotone  decreasing  then 
r  has  an  inverse  s  =  r-1  and  in  this  case  one  can  show  that 


fv(y)  =  fx(s(y)) 


ds(y) 


(2.12) 


2.12  Transformations  of  Several  Random  Variables 

In  some  cases  we  are  interested  in  transformations  of  several  random  variables. 
For  example,  if  X  and  Y  are  given  random  variables,  we  might  want  to  know 
the  distribution  of  X/Y ,  X  +  V,  max{X,  Y}  or  min{X,  Y}.  Let  Z  =  r(X,  Y) 
be  the  function  of  interest.  The  steps  for  finding  fz  are  the  same  as  before: 


Three  Steps  for  Transformations 

1.  For  each  z,  find  the  set  Az  =  {(x,y)  :  r{x,y)  <  z}. 


2.  Find  the  CDF 


Fz(z) 


P (Z  <  z)  =F(r(X,Y)  <  z) 


=  F({(x,y)\  r(x,y)<z})=  [  [  fx,v{x,y) 

J  J  A, 


dx  dy. 


3.  Then  fz(z)  =  F'z(z)- 
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2.48  Example.  Let  Xi,X2  ~  Uniform(0, 1)  be  independent.  Find  the  density 
of  Y  =  X\  +  X2.  The  joint  density  of  (Xi,  X2)  is 


f(x  l,x2) 


1  0  <  X\  <  1,  0  <  X2  <  1 

0  otherwise. 


Let  r(x  1,  £2)  =  x\  +  #2-  Now. 


JV(y)  =  P(X<y)=P(r(X1,X2)<y) 

=  P({(xi,x2)  :  r(xi,x2)  <  y})  =  /  /  f(xi,x2)dxidx2. 

j  J  Ay 

Now  comes  the  hard  part:  finding  First  suppose  that  0  <  y  <  1.  Then 
is  the  triangle  with  vertices  (0,  0),  (?/,  0)  and  (0,  y).  See  Figure  2.6.  In  this  case, 
IIa„  f(xi,X2)dxidx2  is  the  area  of  this  triangle  which  is  y2 / 2.  If  1  <  V  <  2, 
then  Ay  is  everything  in  the  unit  square  except  the  triangle  with  vertices 
(1,  y  —  1),  (1, 1),  (y  —  1, 1).  This  set  has  area  1  —  (2  —  y)2  j 2.  Therefore, 


Fr(y)  = 


By  differentiation,  the  PDF  is 


fv(y)  = 


0 

ui 

2 

1 

1 


(2-y)‘ 


y  <  0 

0  <  y  <  1 

1  <y  <2 
y>  2. 


0  <  y  <  1 

y  1  <  y  <  2 

otherwise. 


2.13  Appendix 


Recall  that  a  probability  measure  P  is  defined  on  a  cr-field  A  of  a  sample 
space  Q.  A  random  variable  X  is  a  measurable  map  X  :  Q  R.  Measurable 
means  that,  for  every  x,  {uo  :  X(uj)  <  x}  e  A. 

2.14  Exercises 


1.  Show  that 


]P(X  =  x)=  F(x+)  -  F(x~) 
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(y-  i-i) 


o  0>°)  1  o  1 


This  is  the  case  0  <  y  <  1.  This  is  the  case  i  <  y  <  2. 

FIGURE  2.6.  The  set  Av  for  example  2.48.  A v  consists  of  all  points  (xi,  a;  2)  in  the 
square  below  the  line  X2  =  y  —  x  1. 


2.  Let  X  be  such  that  ¥(X  =  2)  =  P(X  =  3)  =  1/10  and  ¥(X  =  5)  =  8/10. 
Plot  the  CDF  F.  Use  F  to  find  P(2  <  X  <  4.8)  and  P(2  <  X  <  4.8). 

3.  Prove  Lemma  2.15. 


4.  Let  X  have  probability  density  function 


[1/4  0  <  x  <  1 

fx(x)  =  <  3/8  3  <  x  <  5 

0  otherwise. 


(a)  Find  the  cumulative  distribution  function  of  X. 

(b)  Let  Y  =  1/X.  Find  the  probability  density  function  fy(y)  for  Y. 

Hint:  Consider  three  cases:  \  <  y  <  \  <y  <  1,  and  y  >  1. 

5.  Let  X  and  Y  be  discrete  random  variables.  Show  that  X  and  Y  are 
independent  if  and  only  if  fx,y(x^y)  =  fx(x)fy(y)  for  all  x  and  y. 

6.  Let  X  have  distribution  F  and  density  function  /  and  let  A  be  a  subset 
of  the  real  line.  Let  Ia(x)  be  the  indicator  function  for  A: 


Ia(x) 


1  x  G  A 
0  x  A. 


Let  Y  =  Ia(X).  Find  an  expression  for  the  cumulative  distribution  of 
Y.  (Hint:  first  find  the  probability  mass  function  for  Y.) 


2.14  Exercises 
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7.  Let  X  and  Y  be  independent  and  suppose  that  each  has  a  Uniform(0, 1) 
distribution.  Let  Z  =  min{X,  Y}.  Find  the  density  fz(z)  for  Hint: 
It  might  be  easier  to  first  find  ¥(Z  >  z). 

8.  Let  X  have  CDF  F.  Find  the  CDF  of  X+  =  max{0,X}. 

9.  Let  X  rsj  Exp(/3).  Find  F(x)  and  F_1(g). 

10.  Let  X  and  Y  be  independent.  Show  that  g(X)  is  independent  of  h(Y) 
where  g  and  h  are  functions. 

11.  Suppose  we  toss  a  coin  once  and  let  p  be  the  probability  of  heads.  Let 
X  denote  the  number  of  heads  and  let  Y  denote  the  number  of  tails. 

(a)  Prove  that  X  and  Y  are  dependent. 

(b)  Let  N  ~  Poisson(A)  and  suppose  we  toss  a  coin  N  times.  Let  X  and 
Y  be  the  number  of  heads  and  tails.  Show  that  X  and  Y  are  independent. 

12.  Prove  Theorem  2.33. 

13.  Let  X  ~  X(0, 1)  and  let  Y  —  ex . 

(a)  Find  the  PDF  for  Y .  Plot  it. 

(b)  (Computer  Experiment.)  Generate  a  vector  x  =  (#i, . . .  ,  £10,000)  con¬ 
sisting  of  10,000  random  standard  Normals.  Let  y  =  (?/i, . . . ,  2/10,000) 
where  yi  =  eXi .  Draw  a  histogram  of  y  and  compare  it  to  the  PDF  you 
found  in  part  (a). 

14.  Let  (X,  Y)  be  uniformly  distributed  on  the  unit  disk  {(x,  y)  :  x2  +  y2  < 
1}.  Let  R  =  \/X2  +  Y2.  Find  the  CDF  and  PDF  of  R. 

15.  (A  universal  random  number  generator.)  Let  X  have  a  continuous,  strictly 
increasing  CDF  F.  Let  Y  =  F(X).  Find  the  density  of  Y.  This  is  called 
the  probability  integral  transform.  Now  let  U  ^  Uniform(0, 1)  and  let 
X  =  F-^U).  Show  that  X  F.  Now  write  a  program  that  takes 
Uniform  (0,1)  random  variables  and  generates  random  variables  from 
an  Exponential  (/?)  distribution. 

16.  Let  X  ^  Poisson(A)  and  Y  ~  Poisson(/i)  and  assume  that  X  and  Y  are 
independent.  Show  that  the  distribution  of  X  given  that  X  +  Y  =  n  is 
Binomial(n,  7r)  where  7 r  =  A/(A  +  g). 
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Hint  1:  You  may  use  the  following  fact:  If  X  ~  Poisson(A)  and  Y  ~ 
Poisson (/i),  and  X  and  Y  are  independent,  then  X-\-Y  ~  Poisson(/i  +  A). 

Hint  2:  Note  that  {X  =  x,  X  +  Y  =  n}  =  {X  =  x,  Y  =  n  —  x}. 


17.  Let 


fx,y(x^y) 


c(x  +  y 2)  0  <  x  <  1  and  0  <  y  <  1 

0  otherwise. 


Find  P(X  <  ±  |  Y  =  ±). 

18.  Let  X  N( 3, 16).  Solve  the  following  using  the  Normal  table  and  using 

a  computer  package. 


(a)  Find  P(X  <  7). 

(b)  Find  P(X  >  -2). 

(c)  Find  x  such  that  P(X  >  x)  =  .05. 

(d)  Find  P(0  <  X  <  4). 

(e)  Find  x  such  that  P(|X|  >  |x|)  =  .05. 

19.  Prove  formula  (2.12). 

20.  Let  X,  Y  rsj  Uniform(0, 1)  be  independent.  Find  the  PDF  for  X  —  Y  and 
X/Y. 

21.  Let  Xi, . . . ,  Xn  ^  Exp(/3)  be  HD.  Let  Y  =  max{Xi, . . . ,  Xn}.  Find  the 
PDF  of  Y.  Hint:  Y  y  if  and  only  if  X.^  y  for  %  —  1, . . . ,  ti. 


3 

Expectation 


3.1  Expectation  of  a  Random  Variable 

The  mean,  or  expectation,  of  a  random  variable  X  is  the  average  value  of  X. 


3.1  Definition.  The  expected  value,  or  mean,  or  first  moment,  of 

X  is  defined  to  be 


E(X)  = 


x  dF{x) 


xf(x)  if  X  is  discrete 
f  xf{x)dx  if  X  is  continuous 


assuming  that  the  sum  (or  integral)  is  well  defined.  We  use  the  following 
notation  to  denote  the  expected  value  of  X: 


E(X)  =  EX 


x  dF(x)  =  /i  =  fix  • 


The  expectation  is  a  one-number  summary  of  the  distribution.  Think  of 
E(X)  as  the  average  XlILi  Xi/n  of  a  large  number  of  HD  draws  Xi, . . .  ,Xn. 
The  fact  that  E(X)  E?=i  V  jn  is  actually  more  than  a  heuristic;  it  is  a 
theorem  called  the  law  of  large  numbers  that  we  will  discuss  in  Chapter  5. 

The  notation  f  xdF(x)  deserves  some  comment.  We  use  it  merely  as  a 
convenient  unifying  notation  so  we  don’t  have  to  write  xf(x )  f°r  discrete 
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random  variables  and  f  xf(x)dx  for  continuous  random  variables,  but  you 
should  be  aware  that  f  xdF(x)  has  a  precise  meaning  that  is  discussed  in  real 
analysis  courses. 

To  ensure  that  E(X)  is  well  defined,  we  say  that  E(X)  exists  if  f  \x\ dFx(x)  < 
oo.  Otherwise  we  say  that  the  expectation  does  not  exist. 


3.2  Example.  Let  X  ~  Bernoulli(p).  Then  E(X)  =  Yllc=oxf(x)  =  (0  x  (1  — 
p))  +  (1  x  p)  =  p.  u 

3.3  Example.  Flip  a  fair  coin  two  times.  Let  X  be  the  number  of  heads.  Then, 
E(X)  =  f xdFx(x)  =  Y,xxfx(x)  =  (°  x  /( 0))  +  (1  x  /( 1))  +  (2  x  /( 2))  = 

(0  x  (1/4))  +  (1  x  (1/2))  +  (2  x  (1/4))  =  1.  ■ 

3.4  Example.  Let  X  ~  Uniform(— 1,  3).  Then,  E(X)  =  f  xdFx(x)  =  f  xfx(x)d 
j  x  dx  =  1.  ■ 


3.5  Example.  Recall  that  a  random  variable  has  a  Cauchy  distribution  if  it 
has  density  fx(%)  =  {tt(1  +  x2)}-1.  Using  integration  by  parts,  (set  u  =  x 
and  v  =  tan-1  x), 


2  x  dx 

7 T  J0  1FX2 


x  tan 


l 


*oo 


tan  1  x  dx  =  oo 


so  the  mean  does  not  exist.  If  you  simulate  a  Cauchy  distribution  many  times 
and  take  the  average,  you  will  see  that  the  average  never  settles  down.  This 
is  because  the  Cauchy  has  thick  tails  and  hence  extreme  observations  are 
common.  ■ 


From  now  on,  whenever  we  discuss  expectations,  we  implicitly  assume  that 
they  exist. 

Let  Y  =  r(X).  How  do  we  compute  E(T)?  One  way  is  to  find  fy(y)  and 
then  compute  E(T)  =  f  yfy(y)dy.  But  there  is  an  easier  way. 


3.6  Theorem  (The  Rule  of  the  Lazy  Statistician).  Let  Y  =  r(X).  Then 


E(T)  =E(r(X)) 


r{x)dFx{x). 


This  result  makes  intuitive  sense.  Think  of  playing  a  game  where  we  draw 
X  at  random  and  then  I  pay  you  Y  =  r(X).  Your  average  income  is  r(x)  times 
the  chance  that  X  =  x,  summed  (or  integrated)  over  all  values  of  x.  Here  is 
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a  special  case.  Let  A  be  an  event  and  let  r(x)  =  Ia(x)  where  Ia(x)  =  1  if 
x  G  A  and  Ia(%)  =  0  if  x  A.  Then 


E(Ia(X))  = 


lA{x)fx{x)dx  = 


f  fx(x)dx 
A 


P(X  e  A). 


In  other  words,  probability  is  a  special  case  of  expectation. 


3.7  Example.  Let  X  ~  Unif(0, 1).  Let  Y  =  r{X)  =  ex .  Then, 

E(Y)  =  [  ex f(x)dx  = 

Jo 

Alternatively,  you  could  hnd  fy(y)  which  turns  out  to  be  fy(y)  =  1/y  for 
1  <y  <e.  Then,  E(T)  =  f^yf(y)dy  =  e  -  1.  m 


3.8  Example.  Take  a  stick  of  unit  length  and  break  it  at  random.  Let  Y  be 
the  length  of  the  longer  piece.  What  is  the  mean  of  Y?  If  X  is  the  break  point 
then  X  ^  Unif(0, 1)  and  Y  =  r(X)  =  max{X,  1  —  X}.  Thus,  r(x)  =  1  —  x 
when  0  <  x  <  1/2  and  r(x)  =  x  when  1/2  <  x  <  1.  Hence, 


E(Y)  = 


r{x)dF(x) 


x)dx  +  /  xdx 

J 1/2 


Functions  of  several  variables  are  handled  in  a  similar  way.  If  Z  =  r(X,  Y) 
then 

E(Z)  =E(r(X,Y))  =  [  I  r(x,y)dF(x,y).  (3.4) 


3.9  Example.  Let  (X,  Y)  have  a  jointly  uniform  distribution  on  the  unit 
square.  Let  Z  =  r(X,  Y)  =  X2  +  Y2.  Then, 


E (Z)  =  J  J  r(x,y)dF(x,y)  =  J  J  (x2  +  y2)  dxdy 


,1  fi 

.2  7  ,  /  2 


x  dx  +  /  y  dy  =  - 


The  kth  moment  of  X  is  defined  to  be  E(Xfc)  assuming  that  E(|X  \k)  <  oo 


3.10  Theorem.  If  the  kth  moment  exists  and  if  j  <  k  then  the  jth  moment 
exists. 


Proof.  We  have 


>oo 


EIX 


3  — 


x\J  fx(x)dx 


—  oo 
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x\Jfx(x)dx  + 


x\Jfx(x)dx 


x  <1 


X  >1 


<  /  fx(x)dx  + 

|tc|  <  1  | a? |  >  1 

<  l+E(|X|fe)  <  oo.  ■ 


x\k  fx(x)dx 


The  kt  l  central  moment  is  defined  to  be  E((X  —  p.)' ) 


3.2  Properties  of  Expectations 


3.11  Theorem.  If  X\, . . . ,  Xn  are  random  variables  and  a\, . . . ,  an  are  con¬ 
stants,  then 

E  =  y^ajE(Xj).  (3.5) 

3.12  Example.  Let  X  ~  Binomial(n,p).  What  is  the  mean  of  XI  We  could 
try  to  appeal  to  the  definition: 

E(X)  =  f  xdFx(x)  =  ^ xfx(x )  =  'I2x(n^jpx(  1  -p)n~x 

J  x  x=0  V/ 

but  this  is  not  an  easy  sum  to  evaluate.  Instead,  note  that  X  =  Y^i=i  ^ 
where  Xi  =  1  if  the  ith  toss  is  heads  and  Xi  =  0  otherwise.  Then  E(X^)  = 
(px  1)  +  ((1  -p)  xO)  =p  and  E(X)  =  E(^  Xi)  =  ^  •  E (Xi)  =  np.  m 

3.13  Theorem.  Let  X\, . . .  ,Xn  be  independent  random  variables.  Then, 


e  nx* 


Notice  that  the  summation  rule  does  not  require  independence  but  the 
multiplication  rule  does. 


3.3  Variance  and  Covariance 

The  variance  measures  the  “spread”  of  a  distribution.  1 


1We  can’t  use  E(X  —  p)  as  a  measure  of  spread  since  E(X  —  p)  =  E(X)  —  p  =  p  —  p  =  0. 
We  can  and  sometimes  do  use  E|X  —  p\  as  a  measure  of  spread  but  more  often  we  use  the 


variance. 


3.3  Variance  and  Covariance 
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3.14  Definition.  Let  X  be  a  random  variable  with  mean  fi.  The  variance 
of  X  —  denoted  by  a2  or  g\  orY(X)  orYX  —  is  defined  by 


E<x  -  m  =  /(,  - 


(3.7) 


assuming  this  expectation  exists.  The  standard  deviation  is 
sd(X)  =  ^/Y(X)  and  is  also  denoted  by  a  and  ax- 


3.15  Theorem.  Assuming  the  variance  is  well  defined ,  it  has  the  following 
properties: 

1.  Y(X)  =E(X2)  -ii2. 

2.  If  a  and  b  are  constants  then  Y(aX  +  b)  =  a2Y(X). 

3.  If  X i, . . . ,  Xn  are  independent  and  ai, . . . ,  an  are  constants ,  then 


7=1 


(3.8) 


7=1 


3.16  Example.  Let  X  ~  Binomial(n,p).  We  write  X  =  JT  Xi  where  Xi  =  1 
if  toss  i  is  heads  and  Xi  =  0  otherwise.  Then  X  =  JT  Xi  and  the  random 
variables  are  independent.  Also,  P(X^  =  1)  =  p  and  ¥(Xi  =  0)  =  1  —  p.  Recall 
that 

E(V)  =  (p  x  lj  +  Ml  - p)  x  oj  =p. 

Now, 


E(Xf)  =  Ip  X  l2 j  +  Ol  - p)  x  02J  =  p. 

Therefore,  Y(Xi)  =  E(X2)  —  p2  =  p  —  p2  =  p(l  —  p).  Finally,  Y(X)  = 
=  EiV(Xi)  =  E7KI  -p)  =  np(l-p).  Notice  that  Y(X)  =  0 
if  p  =  1  or  p  =  0.  Make  sure  you  see  why  this  makes  intuitive  sense.  ■ 

If  X\ , . . . ,  Xn  are  random  variables  then  we  define  the  sample  mean  to  be 


—  1  n 
xn  =  - Vx2 

n  (  ^ 


(3.9) 


7=1 


and  the  sample  variance  to  be 


S2n  = - 7 

n  -  1 


^(V-vn)2 


7=1 


(3.10) 
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3.17  Theorem.  Let  Xi, . . .  ,Xn  be  iid  and  let  p  =  E (Xi),  a2  =  V(X^).  Then 

2 

E(Xn)  =  p,  Y(Xn)  =  —  and  E(S2)  =  ct2. 

ITb 

If  X  and  Y  are  random  variables,  then  the  covariance  and  correlation  be¬ 
tween  X  and  Y  measure  how  strong  the  linear  relationship  is  between  X  and 

y. 


3.18  Definition.  Let  X  and  Y  be  random  variables  with  means  /ax  and 
l±Y  and  standard  deviations  ax  and  ay-  Define  the  covariance  between 
X  and  Y  by 

Co  v(X,Y)=m(X-fix)(Y-nY)]  (3.11) 


3.19  Theorem.  The  covariance  satisfies: 

Cov(X,  Y)  =  E(XY)  -  E(X)E(y). 

The  correlation  satisfies: 

-1  <  p(X,Y)  <  1. 

If  Y  =  aX  +  b  for  some  constants  a  and  b  then  p(X,Y)  =  1  if  a  >  0  and 
p(X,Y)  =  -1  if  a  <  0.  If  X  and  Y  are  independent ,  then  Cov(X,  Y)  =  p  =  0. 
The  converse  is  not  true  in  general. 

3.20  Theorem.  V(X  +  Y)  =  V(X)  +  V(Y)  +  2Cov(X,  Y)  and  V(X  —  Y)  = 
V(X)  +V(y)  —  2Cov(X,  Y).  More  generally ,  for  random  variables  Xi, . . .  ,Xn, 

Y  f  ^2  aX  j  =  ^  afV(Xi)  +  2  E  E  «*«iCov(V,  Xj). 

\  i  /  i  i<j 

3.4  Expectation  and  Variance  of  Important  Random 
Variables 

Here  we  record  the  expectation  of  some  important  random  variables: 


3.4  Expectation  and  Variance  of  Important  Random  Variables 
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Distribution 

Mean 

Variance 

Point  mass  at  a 

a 

0 

Bernoulli  (p) 

p 

p(  1  -  p) 

Binomial(n,p) 

np 

np(  1  —  p) 

Geometric  (p) 

i  Ip 

(i  -  p)/p 2 

Poisson(A) 

A 

A 

Uniform(a,  b) 

( a  +  b)j  2 

(b-a)2/ 12 

Normal(/i,  a2) 

P 

a2 

Exponential  (f3) 

a 

P2 

Gamma(a,  (3) 

af3 

af32 

Beta(<a,  (3) 

ol  j (cn  ~\~  /3) 

<y{3 / ((o  T  /3)2(g?  +  f3  +  1)) 

U 

0  (if  v  >  1) 

v/{v-2)  (liv  >  2) 

x2P 

p 

2  p 

Multinomial(n,  p) 

np 

see  below 

Multivariate  Normal(/i,  E) 

p 

S 

We  derived  E(X)  and  V(X)  for  the  Binomial  in  the  previous  section.  The 
calculations  for  some  of  the  others  are  in  the  exercises. 

The  last  two  entries  in  the  table  are  multivariate  models  which  involve  a 
random  vector  X  of  the  form 


X  = 


xk 


The  mean  of  a  random  vector  X  is  defined  by 


E(Xl) 


E  (Xk) 


The  variance-covariance  matrix  S  is  defined  to  be 


V(X) 


V(V)  Cov(X1;X2) 

Cov(X2,  Xi)  V(X2) 

CovpC*,*!)  Cov(Xfc,X2) 


Cov(X\,Xk) 
Cov(X2,  Xk) 

V(Xfc) 


If  X  ^  Multinomial(n,p)  then  E(X)  =  np  =  n(pi, . . .  ,pk)  and 


V(X)  = 


npi(l-pi)  -npip2 
-np2pi  np2(l~P2) 


npkP  1 


npiPk 

np2Pk 


npkP2 


npk(  1  ~Pk) 
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To  see  this,  note  that  the  marginal  distribution  of  any  one  component  of  the 
vector  Xi  ~  Binomial(n,p^).  Thus,  E(X^)  =  npi  and  V(X^)  =  npi(  1  —  Pi). 
Note  also  that  Xi  +  Xj  ~  Binomial(n,  p^  +  Pj)-  Thus,  V(X^  +  Xj)  =  n(pi  + 
Pj){  1  —  \pi  +  Pj]).  On  the  other  hand,  using  the  formula  for  the  variance 
of  a  sum,  we  have  that  V(X^  +  Xj)  =  V(X^)  +  V(Xj)  +  2Cov(X^,Xj)  = 
npi(l  —  +  npj(l  —  pj)  +  2Cov(Xi,Xj).  If  we  equate  this  formula  with 

n(pi  +Pj)(  1  —  [pi  +Pj])  and  solve,  we  get  Cov(X^,Xj)  =  —npipj. 

Finally,  here  is  a  lemma  that  can  be  useful  for  finding  means  and  variances 
of  linear  combinations  of  multivariate  random  vectors. 

3.21  Lemma.  If  a  is  a  vector  and  X  is  a  random  vector  with  mean  fi  and 
variance  E,  then  E(aTX)  =  aT fi  andV(aTX)  =  aTEa.  If  A  is  a  matrix  then 
E (AX)  =  A/jl  and  V(AX)  =  AZAT . 

3.5  Conditional  Expectation 

Suppose  that  X  and  Y  are  random  variables.  What  is  the  mean  of  X  among 
those  times  when  Y  —  y?  The  answer  is  that  we  compute  the  mean  of  X  as 
before  but  we  substitute  fx\Y(x\y)  f°r  fx(x)  in  the  definition  of  expectation. 


3.22  Definition.  The  conditional  expectation  of  X  given  Y  =  y  is 


E(X\Y  =  y) 


fx\Y(x\y)  dx  discrete  case 
/  x  fx\Y(x\y)  dx  continuous  case. 


(3.13) 


Ifr(x,y )  is  a  function  of  x  and  y  then 


E(r(X,Y)\Y  =  y) 


T,r(x,y)fx\Y(x\y)dx  discrete  case 
f  r(x,  y)  fx\Y(x\y)  dx  continuous  case. 


(3.14) 


Warning!  Here  is  a  subtle  point.  Whereas  E(X)  is  a  number,  E(X| Y  =  y ) 
is  a  function  of  y.  Before  we  observe  T,  we  don’t  know  the  value  of  E(X|T  =  y) 
so  it  is  a  random  variable  which  we  denote  E(X| Y).  In  other  words,  E(X| Y) 
is  the  random  variable  whose  value  is  E(X| Y  =  y)  when  Y  =  y.  Similarly, 
E(r(X,  Y)\Y)  is  the  random  variable  whose  value  is  E(r(X,  Y)\Y  =  y)  when 
Y  =  y.  This  is  a  very  confusing  point  so  let  us  look  at  an  example. 

3.23  Example.  Suppose  we  draw  X  ^  Unif(0, 1).  After  we  observe  X  =  x, 
we  draw  Y|X  =  x  ~  Unif(x,  1).  Intuitively,  we  expect  that  E(Y|X  =  x)  = 
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(1  +  x)/2.  In  fact,  fy \x(y\x)  =  1/(1  —  x)  for  x  <  y  <  1  and 


E(Y\X  =  x) 


l 


X 


yfv\x{y\x)dy 


as  expected.  Thus,  E(Y\X)  =  (1  +  X)/2.  Notice  that  E(Y\X) 
a  random  variable  whose  value  is  the  number  E(Y\X  =  x)  = 
X  =  x  is  observed.  ■ 


1  +  x 

2 

=  (l+X)/2  is 
(1  +  x)/2  once 


3.24  Theorem  (The  Rule  of  Iterated  Expectations).  For  random  variables  X 
and  Y ,  assuming  the  expectations  exist,  we  have  that 

E  \E(Y\X)]  =  E(Y)  and  E  \E(X\Y)]  =  E(X).  (3.15) 

More  generally,  for  any  function  r(x,y)  we  have 

E[E(r(X,T)|X)]  =E(r(X,Y)).  (3.16) 

Proof.  We’ll  prove  the  first  equation.  Using  the  definition  of  conditional 
expectation  and  the  fact  that  f(x,y)  =  f(x)f(y \x), 


E[E(Y|X); 


E(Y\X  =  x)fx(x)dx  =  yf(y\x)dyf(x)dx 


/  yf{y\x)f(x)dxdy 


yf(x,  y)dxdy  =  E(Y).  ■ 


3.25  Example.  Consider  example  3.23.  How  can  we  compute  E(Y)?  One 
method  is  to  find  the  joint  density  f(x ,  y)  and  then  compute  E(Y)  =  f  f  yf(x ,  y)dxdy. 
An  easier  way  is  to  do  this  in  two  steps.  First,  we  already  know  that  E(Y\X)  = 

(1  +  X) /2.  Thus, 


E(Y) 


EE(T|X)  =  E 


(l  +  X) 


(1+E(X))  (1  +  (1/2)) 


=  3/4. 


3.26  Definition.  The  conditional  variance  is  defined  as 


¥(T|X  =  x)=  /  (y  -  fi(x))2f(y\x)dy  (3.17) 


where  fi(x)  =  E(Y|X  =  x). 

3.27  Theorem.  For  random  variables  X  and  Y , 

V(Y)  =  E¥(T|X)  +  ¥E(T|X). 
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3.28  Example.  Draw  a  county  at  random  from  the  United  States.  Then  draw 
n  people  at  random  from  the  county.  Let  X  be  the  number  of  those  people 
who  have  a  certain  disease.  If  Q  denotes  the  proportion  of  people  in  that 
county  with  the  disease,  then  Q  is  also  a  random  variable  since  it  varies  from 
county  to  county.  Given  Q  =  g,  we  have  that  X  ~  Binomial(n,  q).  Thus, 
E(X| Q  =  q)  =  nq  and  V(X|Q  =  q)  =  nq(  1  —  q).  Suppose  that  the  random 
variable  Q  has  a  Uniform  (0,1)  distribution.  A  distribution  that  is  constructed 
in  stages  like  this  is  called  a  hierarchical  model  and  can  be  written  as 

Q  ~  Uniform(0, 1) 

X\Q  =  q  ~  Binomial(n,  q). 

Now,  E(X)  =  EE(X|Q)  =  E (nQ)  =  nE(Q)  =  n/2.  Let  us  compute  the 
variance  of  X.  Now,  V(X)  =  EY(X|Q)  +  VE(X|Q).  Let’s  compute  these 
two  terms.  First,  EV(X|Q)  =  E[nQ(l  —  Q)]  =  nE(Q(l  —  Q))  =  n  f  q(l  — 
q)f(q)dq  =  n  q(l  —  q)dq  =  n/6.  Next,  VE(X|Q)  =  Y(nQ)  =  n2V(Q)  = 
n2  f(q  —  (1/2 ))2dq  =  n2/12.  Hence,  V(X)  =  (n/6)  +  (n2/12).  ■ 

3.6  Moment  Generating  Functions 

Now  we  will  define  the  moment  generating  function  which  is  used  for  finding 
moments,  for  finding  the  distribution  of  sums  of  random  variables  and  which 
is  also  used  in  the  proofs  of  some  theorems. 


3.29  Definition.  The  moment  generating  function  MGF,  or  Laplace 
transform,  of  X  is  defined  by 


Va(£)  =  E(etx)  =  J  etxdF(x) 


where  t  varies  over  the  real  numbers. 


In  what  follows,  we  assume  that  the  MGF  is  well  defined  for  all  t  in  some 
open  interval  around  t  =  0.  2 

When  the  MGF  is  well  defined,  it  can  be  shown  that  we  can  interchange  the 
operations  of  differentiation  and  “taking  expectation.”  This  leads  to 


G(0)  =  fEetx 

C L  L 


=  E  fetx  =  E  \Xetx}.  „  =  E(X). 

dt  1  J  t=0  v  ; 

t= o  laL  U=o 


2 A  related  function  is  the  characteristic  function,  defined  by  K(eltx)  where  %  —  \/—l.  This 
function  is  always  well  defined  for  all  t. 
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By  taking  k  derivatives  we  conclude  that  fi^(0)  =  E(Xfe).  This  gives  us  a 
method  for  computing  the  moments  of  a  distribution. 


3.30  Example.  Let  X  ~  Exp(l).  For  any  t  <  1. 


>oo 


>oo 


fix(t)  =  Eetx  =  I  etxe~xdx  =  /  e^~1)xdx  = 

o  Jo  1  —  t 


1 


The  integral  is  divergent  if  t  >  1.  So,  fix(t)  =  1/(1  —  t)  for  all  t  <  1.  Now, 
V>'(0)  =  1  and  ^"(0)  =  2.  Hence,  E(X)  =  1  and  V(X)  =  E (X2)-/i2  =  2-1  = 

1.  ■ 


3.31  Lemma.  Properties  of  the  mgf. 

(1)  IfY  =  aX  +  5,  then  fiy  (t)  =  eht>fix  (at)  • 

(2)  If  Xi, ... ,  Xn  are  independent  and  Y  =  JN  Xi7  then  fiyit)  =  E[^  VW) 
where  fi>i  is  the  MGF  of  Xi. 

3.32  Example.  Let  X  ~  Binomial(n,p).  We  know  that  X  =  Xi  where 
E(JQ  —  1)  —  p  and  P(X^  =  0)  =  1  —  p.  Now  fii(t)  —  EeXit  =  (p  x  et)  +  ((1  — 
p))  =  pel  +  q  where  q  =  1  —  p.  Thus,  fixit)  =  fli  =  (Pet  +  <?)n-  ■ 


Recall  that  X  and  Y  are  equal  in  distribution  if  they  have  the  same  distri¬ 
bution  function  and  we  write  X  =  Y. 


3.33  Theorem.  Let  X  and  Y  be  random  variables.  If  fix  (t)  =  fiv(t)  for  all  t 
in  an  open  interval  around  0 ,  then  X  =  Y . 

3.34  Example.  Let  X\  ~  Binomial(ni,p)  and  X 2  ^  Binomial(n2,p)  be  inde¬ 
pendent.  Let  Y  =  X\  +  X2.  Then, 

fiy(t)  =  fii(t)fi2(t)  =  {pe*  +  ^)ni(pet  +  q)n 2  =  (pe*  +  g)ni+n2 

and  we  recognize  the  latter  as  the  MGF  of  a  Binomial(ni  +  n2,p)  distribu¬ 
tion.  Since  the  MGF  characterizes  the  distribution  (i.e.,  there  can’t  be  an¬ 
other  random  variable  which  has  the  same  mgf)  we  conclude  that  Y  ^ 
Binomial(ni  +  n2,p).  ■ 
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I  Moment  Generating  Functions  for  Some  Common  Distributions  1 

Distribution 

MGF  ip(t) 

Bernoulli  (p) 

pef  +  (1  —  p) 

Binomial  (n,p) 

(pe1  +  (1  —  p))n 

Poisson(A) 

Normal  (/qcr) 

exp  |/it  +  g-£~} 

Gamma  (a,/3) 

(l-/Jt)  iort<1/P 

3.35  Example.  Let  Y\  ^  Poisson(Ai)  and  >2  ^  Poisson(A2)  be  independent. 
The  moment  generating  function  of  Y  =  Y\  +  T  +  2  is  Vt(£)  =  ripY1  (t)'ipy2  (t)  = 
eAi(e  -i)eA2(e  -l)  _  e(Ai+A2)(e  -l)  wj1jcj1  the  moment  generating  function 

of  a  Poisson(Ai  +  A2).  We  have  thus  proved  that  the  sum  of  two  independent 
Poisson  random  variables  has  a  Poisson  distribution.  ■ 


3.7  Appendix 

Expectation  as  an  Integral.  The  integral  of  a  measurable  function  r(x) 
is  defined  as  follows.  First  suppose  that  r  is  simple,  meaning  that  it  takes 
finitely  many  values  ai, . . . ,  over  a  partition  Ai, . . . ,  A^.  Then  define 

/k 

r{x)dF(x)  =  ^~^cqP(r(X)  e  Ai). 

%=i 

The  integral  of  a  positive  measurable  function  r  is  defined  by  f  r{x)dF(x)  = 
lim^  f  ri(x)dF(x)  where  7y  is  a  sequence  of  simple  functions  such  that  7 \{x)  < 
r(x)  and  Ti(x)  -A  r(x)  as  i  -A  00.  This  does  not  depend  on  the  particular  se¬ 
quence.  The  integral  of  a  measurable  function  r  is  defined  to  be  f  r(x)dF(x)  = 
f  r+(x)dF(x)—f  r~  (x)dF(x)  assuming  both  integrals  are  finite,  where  r+(x )  = 
max{r(x),0}  and  r~(x )  =  —  min{r(x),  0}. 


3.8  Exercises 

1.  Suppose  we  play  a  game  where  we  start  with  c  dollars.  On  each  play  of 
the  game  you  either  double  or  halve  your  money,  with  equal  probability. 
What  is  your  expected  fortune  after  n  trials? 
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2.  Show  that  V(X)  =  0  if  and  only  if  there  is  a  constant  c  such  that 
P(X  =  c)  =  1. 

3.  Let  Xi, . . . ,  Xn  ~  Uniform(0, 1)  and  let  Yn  =  max{Xi, . . . ,  Xn}.  Find 
E(Yn). 

4.  A  particle  starts  at  the  origin  of  the  real  line  and  moves  along  the  line  in 
jumps  of  one  unit.  For  each  jump  the  probability  is  p  that  the  particle 
will  jump  one  unit  to  the  left  and  the  probability  is  1—p  that  the  particle 
will  jump  one  unit  to  the  right.  Let  Xn  be  the  position  of  the  particle 
after  n  units.  Find  E(Xn)  and  Y(Xn).  (This  is  known  as  a  random 
walk.) 

5.  A  fair  coin  is  tossed  until  a  head  is  obtained.  What  is  the  expected 
number  of  tosses  that  will  be  required? 

6.  Prove  Theorem  3.6  for  discrete  random  variables. 

7.  Let  X  be  a  continuous  random  variable  with  CDF  F.  Suppose  that 
P(X  >  0)  =  1  and  that  E(X)  exists.  Show  that  E(X)  =  J0°°  ¥(X  > 
x)dx. 

Hint:  Consider  integrating  by  parts.  The  following  fact  is  helpful:  if  E(X) 
exists  then  lim^^oo  x  [1  -  F(x)\  =  0. 

8.  Prove  Theorem  3.17. 

9.  (Computer  Experiment.)  Let  Xi,  X2, . . . ,  Xn  be  N( 0, 1)  random  variables 

and  let  Xn  =  n_1  Plot  Xn  versus  n  for  n  =  1, . . . ,  10,  000. 

Repeat  for  Xi,  X2, . . . ,  Xn  ~  Cauchy.  Explain  why  there  is  such  a  dif¬ 
ference. 

10.  Let  X  -  N( 0, 1)  and  let  Y  =  ex .  Find  E(T)  and  ¥(T). 

11.  (Computer  Experiment:  Simulating  the  Stock  Market.)  Let  be 

independent  random  variables  such  that  P(Yi  =  1)  =  P(Yi  =  —1)  = 
1/2.  Let  Xn  =  Y^i= 1  Think  of  Yi  =  1  as  “the  stock  price  increased 
by  one  dollar”,  Yi  =  —1  as  “the  stock  price  decreased  by  one  dollar”, 
and  Xn  as  the  value  of  the  stock  on  day  n. 

(a)  Find  E(Xn)  and  Y(Xn). 

(b)  Simulate  Xn  and  plot  Xn  versus  n  for  n  =  1,  2, . . . ,  10, 000.  Repeat 
the  whole  simulation  several  times.  Notice  two  things.  First,  it’s  easy 
to  “see”  patterns  in  the  sequence  even  though  it  is  random.  Second, 
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you  will  find  that  the  four  runs  look  very  different  even  though  they 
were  generated  the  same  way.  How  do  the  calculations  in  (a)  explain 
the  second  observation? 

12.  Prove  the  formulas  given  in  the  table  at  the  beginning  of  Section  3.4 

for  the  Bernoulli,  Poisson,  Uniform,  Exponential,  Gamma,  and  Beta. 
Here  are  some  hints.  For  the  mean  of  the  Poisson,  use  the  fact  that 
ea  =  aX !x-  To  compute  the  variance,  first  compute  E(X(X  —  1)). 

For  the  mean  of  the  Gamma,  it  will  help  to  multiply  and  divide  by 
L(<a  +  1  )//3a+1  and  use  the  fact  that  a  Gamma  density  integrates  to  1. 
For  the  Beta,  multiply  and  divide  by  L(<a  +  l)L(/3)/L(<a  +  f3  +  1). 

13.  Suppose  we  generate  a  random  variable  X  in  the  following  way.  First 
we  flip  a  fair  coin.  If  the  coin  is  heads,  take  X  to  have  a  Unif(0,l) 
distribution.  If  the  coin  is  tails,  take  X  to  have  a  Unif(3,4)  distribution. 

(a)  Find  the  mean  of  X. 

(b)  Find  the  standard  deviation  of  X. 

14.  Let  Xi, . . . ,  Xm  and  Yi, . . . ,  Yn  be  random  variables  and  let  ai, . . . ,  am 
and  &i, . . . ,  bn  be  constants.  Show  that 


m 


n 


m  n 


Cov  [  ^2aiXi,  ^bjYj  J  =  ^2  aibjCoy(Xij  Yj), 

i= 1  J  =  1  )  i=1  3  =  l 


15.  Let 


fx,y{x,y)  =  |  Q 


^{x  -\- y)  0  <  x  <  1,  0<i/<2 
otherwise. 


Find  ¥(2X  -  3Y  +  8) 


16.  Let  r(x)  be  a  function  of  x  and  let  s(y)  be  a  function  of  y.  Show  that 


E(r(X)s(Y)\X)  =r(X)E(s(Y)|X) 


Also,  show  that  E(r(X)|X)  =  r(X). 


17.  Prove  that 

V(Y)  =  E¥(Y  |  X)  +  ¥E (Y  \  X). 

Hint:  Let  m  =  E(Y)  and  let  b(x)  =  E(Y|X  =  x).  Note  that  E(6(X))  = 
EE(Y|X)  =  E(Y)  =  m.  Bear  in  mind  that  b  is  a  function  of  x.  Now 
write  V(Y)  =  E(F  -  to)2  =  E((Y  -  b(X))  +  (b(X)  -  to))2.  Expand  the 
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square  and  take  the  expectation.  You  then  have  to  take  the  expectation 
of  three  terms.  In  each  case,  use  the  rule  of  the  iterated  expectation: 
E(stuff)  =  E(E(stuff|X)). 

18.  Show  that  if  E(X|Y  =  y)  =  c  for  some  constant  c,  then  X  and  Y  are 
uncorrelated. 


19.  This  question  is  to  help  you  understand  the  idea  of  a  sampling  dis¬ 
tribution.  Let  Xi,...,Xn  be  HD  with  mean  fi  and  variance  a2.  Let 
Xn  —  n-1  Ym=i  Then  Xn  is  a  statistic,  that  is,  a  function  of  the 
data.  Since  Xn  is  a  random  variable,  it  has  a  distribution.  This  distri¬ 
bution  is  called  the  sampling  distribution  of  the  statistic.  Recall  from 
Theorem  3.17  that  E(Xn)  =  fi  and  V(Xn)  =  a2  jn.  Don’t  confuse  the 
distribution  of  the  data  fx  and  the  distribution  of  the  statistic  .  To 
make  this  clear,  let  Xi, . . . ,  Xn  ~  Uniform(0, 1).  Let  fx  be  the  density 
of  the  Uniform(0, 1).  Plot  fx •  Now  let  Xn  =  n-1  Y^i=i  X i •  Find  E(Xn) 
and  V(Xn).  Plot  them  as  a  function  of  n.  Interpret.  Now  simulate  the 
distribution  of  Xn  for  n  =  1,5,  25, 100.  Check  that  the  simulated  values 
of  E(Xn)  and  V(Xn)  agree  with  your  theoretical  calculations.  What  do 
you  notice  about  the  sampling  distribution  of  Xn  as  n  increases? 

20.  Prove  Lemma  3.21. 


21.  Let  X  and  Y  be  random  variables.  Suppose  that  E(Y|X)  =  X.  Show 
that  Cov(X,  Y)  =  V(X). 

22.  Let  X  ^  Uniform(0, 1).  Let  0  <  a  <  b  <  1.  Let 

y  _  f  1  0  <  x  <  b 

I  0  otherwise 


and  let 


Z 


1  a  <  x  <  1 
0  otherwise 


(a)  Are  Y  and  Z  independent?  Why/ Why  not? 

(b)  Find  E(Y|Z).  Hint:  What  values  z  can  Z  take?  Now  find  E(Y|Z  =  z). 

23.  Find  the  moment  generating  function  for  the  Poisson,  Normal,  and 
Gamma  distributions. 


24.  Let  Xi, . . . ,  Xn  ^  Exp(/3).  Find  the  moment  generating  function  of  X^. 
Prove  that  Y^i=i  ~  Gamma (n, /?). 


4 

Inequalities 


4.1  Probability  Inequalities 

Inequalities  are  useful  for  bounding  quantities  that  might  otherwise  be  hard 
to  compute.  They  will  also  be  used  in  the  theory  of  convergence  which  is 
discussed  in  the  next  chapter.  Our  first  inequality  is  Markov’s  inequality. 

4.1  Theorem  (Markov’s  inequality).  Let  X  be  a  non-negative  random 
variable  and  suppose  that  E(X)  exists.  For  any  t  >  0, 

P(X  >t)<  (4.1) 

t 


Proof.  Since  X  >  0. 


E(X) 


/  xf(x)dx  =  /  xf(x)dx  +  /  xf(x)dx 

Jo  Jo  Jt 

poo  /»oo 

/  xf(x)dx  f{x)dx  =  t  P(X  >  t) 

Jt  Jt 
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4.2  Theorem  (Chebyshev’s  inequality).  Let  At  =  E(X)  and  a2  =  V(X). 
Then, 

2  1 

P(|X  —  n\  >t)  <  ^  and  P(|Z|  >  k)  <  -^  (4.2) 

t  Kj 

where  Z  =  (X  —  /i)  ja .  In  particular ,  F(\Z\  >  2)  <  1/4  and 
F(\Z\  >  3)  <  1/9. 


Proof.  We  use  Markov’s  inequality  to  conclude  that 


F(\X  -p\>t)  =  F(\X  -  fi\z  >  r)  < 


,  E(X  -  /a)2  a2 


The  second  part  follows  by  setting  t  =  kcr.  u 

4.3  Example.  Suppose  we  test  a  prediction  method,  a  neural  net  for  example, 
on  a  set  of  n  new  test  cases.  Let  Xi  =  1  if  the  predictor  is  wrong  and  Xi  =  0 
if  the  predictor  is  right.  Then  Xn  =  n-1  ^  observed  error  rate. 

Each  Xi  may  be  regarded  as  a  Bernoulli  with  unknown  mean  p.  We  would 
like  to  know  the  true  —  but  unknown  —  error  rate  p.  Intuitively,  we  expect 
that  Xn  should  be  close  to  p.  How  likely  is  Xn  to  not  be  within  e  of  p?  We 
have  that  Y(Xn)  =  Y{X\)/n  —  p(  1  —  p)jn  and 


P(|W  ~p\  >  e)  < 


Y(X„)  p(l-p) 


:2  4  ne2 


since  p(  1  —  p)  <\  for  all  p.  For  e  =  .2  and  n  =  100  the  bound  is  .0625.  ■ 

Hoeffding’s  inequality  is  similar  in  spirit  to  Markov’s  inequality  but  it  is  a 
sharper  inequality.  We  present  the  result  here  in  two  parts. 


4.4  Theorem  (Hoeffding’s  Inequality).  Let  Yi, . . . ,  Yn  be  independent 
observations  such  that 

E(Y^)  =  0  and  ai  <  Yi  <  bi.  Let  e  >  0.  Then ,  for  any  t  >  0, 


p  ( J2  Yi  - e  - 


e-te  TT  et  ( bi-di )  /8 


(4.3) 
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4.5  Theorem.  Let  Xi, . . . ,  Xn  ~  Bernoulli (p).  Then ,  for  any  e  >  0, 

P  (\Xn  -  p\  >  e)  <  2e~2n(2  (4.4) 

where  Xn  =  n_1  X)”=1  Xj. 


4.6  Example.  Let  Xi, . . .  ,Xn  ~  Bernoulli(p).  Let  n  =  100  and  e  =  .2.  We 
saw  that  Chebyshev’s  inequality  yielded 

n\Xn-p\  >  e)  <  .0625. 

According  to  Hoeffding’s  inequality, 

F(\Xn-p\  >  .2)  <  2e_2(100)('2)2  =  .00067 

which  is  much  smaller  than  .0625.  ■ 

Hoeffding’s  inequality  gives  us  a  simple  way  to  create  a  confidence  inter¬ 
val  for  a  binomial  parameter  p.  We  will  discuss  confidence  intervals  in  detail 
later  (see  Chapter  6)  but  here  is  the  basic  idea.  Fix  a  >  0  and  let 

By  Hoeffding’s  inequality, 

P  (|Xn  —  p\  >  en)  <  2e_2ne?x  =  a. 

Let  C  =  {Xn  —  en,  Xn  +  en).  Then,  ¥(p  ^  C)  =  ¥(\Xn  —  p\  >  en)  <  a.  Hence, 
P(p  G  C)  >  1  —  ce,  that  is,  the  random  interval  C  traps  the  true  parameter 
value  p  with  probability  1  —  a;  we  call  C  a  1-a  confidence  interval.  More  on 
this  later. 

The  following  inequality  is  useful  for  bounding  probability  statements  about 
Normal  random  variables. 


4.7  Theorem  (Mill’s  Inequality).  Let  Z  ~  iV(0, 1).  Then , 


[T  e-^2/2 

n\z\>t)<J-— 
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4.2  Inequalities  For  Expectations 

This  section  contains  two  inequalities  on  expected  values. 

4.8  Theorem  (Cauchy-Schwartz  inequality).  If  X  andY  have  finite 
variances  then 

E\XY\  <  v/E(X2)E(T2).  (4.5) 


Recall  that  a  function  g  is  convex  if  for  each  x,  y  and  each  a  E  [0,1], 

g{ax  +  (1  -  a)y)  <  ag{x)  +  (1  -  a)g(y). 

If  g  is  twice  differentiable  and  g"(x)  >  0  for  all  x,  then  g  is  convex.  It  can 
be  shown  that  if  g  is  convex,  then  g  lies  above  any  line  that  touches  g  at 
some  point,  called  a  tangent  line.  A  function  g  is  concave  if  —  g  is  convex. 
Examples  of  convex  functions  are  g(x)  =  x 2  and  g(x)  =  ex .  Examples  of 
concave  functions  are  g{x)  =  —x2  and  g(x)  =  logx. 


4.9  Theorem  (Jensen’s  inequality).  If  g  is  convex ,  then 


E g{X)  >  <7 (EX). 


(4.6) 


If  g  is  concave ,  then 


E g(X)  <  g  (EX) 


(4.7) 


Proof.  Let  L(x)  =  a  +  bx  be  a  line,  tangent  to  g{x)  at  the  point  E(X). 
Since  g  is  convex,  it  lies  above  the  line  L(x).  So, 

E g(X)  >  E L(X)  =  E (a  +  bX)  =  a  +  6E(X)  =  Z,(E(X))  =  ^(EX).  ■ 

From  Jensen’s  inequality  we  see  that  E(X2)  >  (EX)2  and  if  X  is  positive, 
then  E(l/X)  >  1/E(X).  Since  log  is  concave,  E(logX)  <  logE(X). 

4.3  Bibliographic  Remarks 

Devroye  et  al.  (1996)  is  a  good  reference  on  probability  inequalities  and  their 
use  in  statistics  and  pattern  recognition.  The  following  proof  of  Hoeffding’s 
inequality  is  from  that  text. 


4.4  Appendix 
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Proof  of  Hoeffding’s  Inequality.  We  will  make  use  of  the  exact  form  of 
Taylor’s  theorem:  if  g  is  a  smooth  function,  then  there  is  a  number  £  £  (0,u) 
such  that  g(u)  =  g(0)  +  ug\ 0)  +  ^ g  (£). 

Proof  of  Theorem  4.4.  For  any  t  >  0,  we  have,  from  Markov’s  inequality, 
that 


n 


n 


Pi  =ip(etY:?=iYi  >ete 


i=l 


i=  1 


<  e  teE  ^  =  e  te[|E(e^). 


(4.8) 


Since  di  <  Y^  <  5^,  we  can  write  Y^  as  a  convex  combination  of  cq  and  5^, 
namely,  =  abi  +  (1  —  where  <a  =  (Y*  —  di)/(bi  —  cq).  So,  by  the 
convexity  of  we  have 


e 


< 


X± _ —etbi  +  X _ 

bi  —  di  hi  —  cii 


Take  expectations  of  both  sides  and  use  the  fact  that  E(Y^)  =  0  to  get 


Ee 


tYi 


< 


di 


dbi 


bi  —  di 


+ 


ia, 


bi  —  di 


,9(u) 


where  u  =  t(bi  —  cq),  g(u)  =  —ju  +  log(l  —  7  +  ye^)  and  7  =  —di/(bi  —  di). 

Note  that  g(0)  =  g'(0)  =  0.  Also,  g  (u)  <1/4  for  all  u  >  0.  By  Taylor’s 
theorem,  there  is  a  £  £  (0,  u)  such  that 


Hence, 


g(u) 


u 2 

#(0)  +  ug  (0)  +  —g  (£) 


u2  " 


g  (0  <  —  = 
2  v  7  _  8 


t2(bi  —  di)‘ 


8 


(d9(u )  (bi  cii)  /8 


The  result  follows  from  (4.8).  ■ 

Proof  of  Theorem  4.5.  Let  Y^  =  (l/n)(JQ  —  p).  Then  E(Y^)  =  0  and 
d  <  Yi  <  b  where  a  =  —  p/n  and  b  =  (1  —  p)/n.  Also,  (5  —  a)2  =  1/n2. 
Applying  Theorem  4.4  we  get 


P(Xn-p>e)=P^Yi>eJ  < 


e~teet2/(8n) 


The  above  holds  for  any  t  >  0.  In  particular,  take  t  =  4ne  and  we  get  E(Xn  — 


p  >  e)  <  e 


—  2  ne" 


By  a  similar  argument  we  can  show  that  E(Xn  —  p  <  —  e)  < 


,—2  ne" 


Putting  these  together  we  get  E  (|Xn  —  p  >  e)  <2e 


—  2  nU 
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4.5  Exercises 


1.  Let  X  ~  Exponent ial(/3).  Find  P(|X  —  fix |  >  kcrx)  for  k  >  1.  Compare 
this  to  the  bound  you  get  from  Chebyshev’s  inequality. 

2.  Let  X  ~  Poisson(A).  Use  Chebyshev’s  inequality  to  show  that  P(X  > 
2A)  <  1/A. 

3.  Let  Xi, . . . ,  Xn  Bernoulli(p)  and  Xn  =  n-1  Y^i=i  Bound  P(|Xn  — 
p |  >  e)  using  Chebyshev’s  inequality  and  using  Hoeffding’s  inequality. 
Show  that,  when  n  is  large,  the  bound  from  Hoeffding’s  inequality  is 
smaller  than  the  bound  from  Chebyshev’s  inequality. 

4.  Let  Xi, . . . ,  Xn  Bernoulli(p). 

(a)  Let  a  >  0  be  fixed  and  define 

Let  pn  =n~1  E”=i  xi-  Define  Cn  =  (pn  -  e„,  pn  +  e„).  Use  Hoeffding’s 
inequality  to  show  that 

P(Cn  contains  p)  >  1  —  a. 

In  practice,  we  truncate  the  interval  so  it  does  not  go  below  0  or  above 

1. 

(b)  (Computer  Experiment.)  Let’s  examine  the  properties  of  this  confi¬ 
dence  interval.  Let  a  =  0.05  and  p  =  0.4.  Conduct  a  simulation  study 
to  see  how  often  the  interval  contains  p  (called  the  coverage).  Do  this 
for  various  values  of  n  between  1  and  10000.  Plot  the  coverage  versus  n. 

(c)  Plot  the  length  of  the  interval  versus  n.  Suppose  we  want  the  length 
of  the  interval  to  be  no  more  than  .05.  How  large  should  n  be? 

5.  Prove  Mill’s  inequality,  Theorem  4.7.  Hint.  Note  that  P(|Z|  >  t)  = 
2 ¥(Z  >  t).  Now  write  out  what  P(Z  >  t)  means  and  note  that  x/t  >  1 
whenever  x  >  t. 

6.  Let  Z  ^  iV(0, 1).  Find  P(|Z|  >  t)  and  plot  this  as  a  function  of  t.  From 

Markov’s  inequality,  we  have  the  bound  P(|Z|  >  t)  <  for  any 

k  >  0.  Plot  these  bounds  for  k  =  1,2,  3,  4,  5  and  compare  them  to  the 
true  value  of  P(|Z|  >  t).  Also,  plot  the  bound  from  Mill’s  inequality. 
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7.  Let  Xi,...,Xn  ^  iV(0, 1).  Bound  P(|Xn|  >  t)  using  Mill’s  inequality, 
where  Xn  =  n~l  Y^=i  Compare  to  the  Chebyshev  bound. 


5 

Convergence  of  Random  Variables 


5.1  Introduction 

The  most  important  aspect  of  probability  theory  concerns  the  behavior  of 
sequences  of  random  variables.  This  part  of  probability  is  called  large  sample 
theory,  or  limit  theory,  or  asymptotic  theory.  The  basic  question  is  this: 
what  can  we  say  about  the  limiting  behavior  of  a  sequence  of  random  variables 
Xi,  X2,  X3, . . .?  Since  statistics  and  data  mining  are  all  about  gathering  data, 
we  will  naturally  be  interested  in  what  happens  as  we  gather  more  and  more 
data. 

In  calculus  we  say  that  a  sequence  of  real  numbers  xn  converges  to  a  limit 
x  if,  for  every  e  >  0,  \xn  —  x\  <  e  for  all  large  n.  In  probability,  convergence  is 
more  subtle.  Going  back  to  calculus  for  a  moment,  suppose  that  Xj^  —  X  for 
all  n.  Then,  trivially,  limn^00  xn  =  x.  Consider  a  probabilistic  version  of  this 
example.  Suppose  that  Xi,X2,...  is  a  sequence  of  random  variables  which 
are  independent  and  suppose  each  has  a  N( 0, 1)  distribution.  Since  these  all 
have  the  same  distribution,  we  are  tempted  to  say  that  Xn  “converges”  to 
X  ~  N( 0, 1).  But  this  can’t  quite  be  right  since  P(Xn  =  X)  =  0  for  all  n. 
(Two  continuous  random  variables  are  equal  with  probability  zero.) 

Here  is  another  example.  Consider  Xi,  X2, . . .  where  Xi  ~  N( 0, 1/n).  Intu¬ 
itively,  Xn  is  very  concentrated  around  0  for  large  n  so  we  would  like  to  say 
that  Xn  converges  to  0.  But  P(Xn  =  0)  =  0  for  all  n.  Clearly,  we  need  to 
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develop  some  tools  for  discussing  convergence  in  a  rigorous  way.  This  chapter 
develops  the  appropriate  methods. 

There  are  two  main  ideas  in  this  chapter  which  we  state  informally  here: 

1.  The  law  of  large  numbers  says  that  the  sample  average  Xn  =  n-1  Y17=i 
converges  in  probability  to  the  expectation  fi  =  E(JQ).  This  means 
that  Xn  is  close  to  fi  with  high  probability. 

2.  The  central  limit  theorem  says  that  ^/n(Xn  —  /i)  converges  in  dis¬ 
tribution  to  a  Normal  distribution.  This  means  that  the  sample  average 
has  approximately  a  Normal  distribution  for  large  n. 


5.2  Types  of  Convergence 

The  two  main  types  of  convergence  are  defined  as  follows. 


5.1  Definition.  Let  X\,X2, ...  be  a  sequence  of  random  variables  and  let 
X  be  another  random  variable.  Let  Fn  denote  the  CDF  of  Xn  and  let  F 
denote  the  CDF  of  X. 


1.  Xn  converges  to  X  in  probability,  written  Xn — >X,  if ‘  for  every 
e  >  0, 

¥(\Xn-X\  >  e)  — )►  0  (5.1) 


as  n  -T  oo. 

2.  Xn  converges  to  X  in  distribution,  written  Xn  V,  */ 

lim  Fn(t)  =  F(t)  (5.2) 

n— >•  oo 

at  all  t  for  which  F  is  continuous. 


When  the  limiting  random  variable  is  a  point  mass,  we  change  the  notation 
slightly.  If  P(X  =  c)  =  1  and  Xn-^  X  then  we  write  Xn— *— )>c.  Similarly,  if 
Xn  X  we  write  Xn  '•v'^  (2' 

There  is  another  type  of  convergence  which  we  introduce  mainly  because  it 
is  useful  for  proving  convergence  in  probability. 
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FIGURE  5.1.  Example  5.3.  Xn  converges  in  distribution  to  X  because  Fn(t)  con¬ 
verges  to  F(t)  at  all  points  except  t  =  0.  Convergence  is  not  required  at  t  =  0 
because  t  —  0  is  not  a  point  of  continuity  for  F. 


5.2  Definition.  Xn  converges  to  X  in  quadratic  mean  (also  called 

qm 

convergence  in  L2),  written  Xn — yX,  if 

E(Xn  -  X)2  0  (5.3) 

as  n  -T  00. 


Again,  if  X  is  a  point  mass  at  c  we  write  Xn^—y  c  instead  of  Xn^—y  X. 

5.3  Example.  Let  Xn  ~  2V(0, 1/n).  Intuitively,  Xn  is  concentrating  at  0  so 
we  would  like  to  say  that  Xn  converges  to  0.  Let’s  see  if  this  is  true.  Let  F  be 
the  distribution  function  for  a  point  mass  at  0.  Note  that  y/ nXn  N(  0,1)- 
Let  Z  denote  a  standard  normal  random  variable.  For  t  <  0,  Fn(t)  =  ¥(Xn  < 
t)  =  F(y/nXn  <  y/nt)  =  ¥(Z  <  y/Et)  —y  0  since  y/nt  —y  —00.  For  t  >  0, 
Fn(t)  =  F(Xn  <  t)  =  F(A/nXn  <  y/nt)  =  ¥(Z  <  y/nt)  1  since  yfnt  — ^  00. 
Hence,  Fn(t)  — ^  F(t)  for  allt  ^  0  and  so  Xn  ^  0.  Notice  that  Fn(  0)  =  1/2  ^ 
F(  1/2)  =  1  so  convergence  fails  at  t  =  0.  That  doesn’t  matter  because  t  =  0 
is  not  a  continuity  point  of  F  and  the  definition  of  convergence  in  distribution 
only  requires  convergence  at  continuity  points.  See  Figure  5.1.  Now  consider 
convergence  in  probability.  For  any  e  >  0,  using  Markov’s  inequality, 


The  next  theorem  gives  the  relationship  between  the  types  of  convergence. 
The  results  are  summarized  in  Figure  5.2. 


5.4  Theorem.  The  following  relationships  hold: 

(a)  Xn  X  implies  that  Xn  -—t  X . 

p 

(b )  Xn  — y  X  implies  that  Xn  ^  X . 

p 

(c)  If  Xn  ^  X  and  if¥(X  =  c)  =  1  for  some  real  number  c,  then  Xn — >  X. 
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In  general ,  none  of  the  reverse  implications  hold  except  the  special  case  in 

(c). 

qm 

Proof.  We  start  by  proving  (a).  Suppose  that  Xn — >X.  Fix  e  >  0.  Then, 
using  Markov’s  inequality, 

o  n  IE  Xn  —  X  ^ 

F(  Xn  -X  >  e  =  P (\Xn  -  X  2  >  e2  <  - ^ - ►  0. 

Proof  of  (b).  This  proof  is  a  little  more  complicated.  You  may  skip  it  if  you 
wish.  Fix  e  >  0  and  let  x  be  a  continuity  point  of  F.  Then 

Fn(x)  =  P(Xn  <  x)  =  P(Xn  <  x, X  <  x  +  e)  +  P(Xn  <  x, X  >  x  +  e) 

<  F(X  <  x  +  e)  +  P(|Xn  -X\>e) 

=  F(x  +  e)  +  P(|Xn  —  X\  >  e). 

Also, 

F(x  —  e)  =  P(X  <  x  —  e)  =  P(X  <  x  —  e,  Xn  <  x)  +  P(X  <  x  —  e,  Xn  >  x) 

<  Fn(x)  +  P(|Xn  —  X|  >  e). 

Hence, 

F(x  —  e)  —  P(|Xn  —  X\  >  e)  <  Fn(x)  <  F(x  +  e)  +  P(|Xn  —  X\  >  e). 

Take  the  limit  as  n  -A  oo  to  conclude  that 

F(x  —  e)  <  lim  inf  Fn  (x)  <  lim  sup  Fn  (x)  <  F(x  +  e) . 

n— ^°°  n— >-00 

This  holds  for  all  e  >  0.  Take  the  limit  as  e  -A  0  and  use  the  fact  that  F  is 
continuous  at  x  and  conclude  that  limn  Fn(x)  =  F(x). 

Proof  of  (c).  Fix  e  >  0.  Then, 

F(\Xn  —  c  >  e)  =  P(Yn  <  c  —  e)  +  P(Yn  >  c  +  e) 

<5  P(Yn  +  c  —  e)  +  P(Xn  >  c  +  e) 

—  Fn(c  —  e)  +  1  —  Fn(c  +  e) 

-A  F(c  —  e)  +  1  —  F(c  +  e) 

=  0  +  1  -1  =  0. 

Let  us  now  show  that  the  reverse  implications  do  not  hold. 

Convergence  in  probability  does  not  imply  convergence  in  quadratic 
mean.  Let  U  rsj  Unif(0, 1)  and  let  Xn  =  y/nI^o^/n^(U).  Then  P(|Xn|  >  e)  = 


P(v^/(0,i/n)(^)  >  e)  =  P(0  <  U  <  1  /n)  =  1/n  0.  Hence,  Xn— ^aO.  But 

E(  V)  =  n  f01/n  du  =  1  for  all  n  so  Xn  does  not  converge  in  quadratic  mean. 

Convergence  in  distribution  does  not  imply  convergence  in  prob¬ 
ability.  Let  X  ~  7V(0, 1).  Let  Xn  =  —X  for  n  =  1,2,3,...;  hence  Xn  ~ 
iV(0, 1).  Xn  has  the  same  distribution  function  as  X  for  all  n  so,  trivially, 
hmnFn(x)  =  F(x)  for  all  x.  Therefore,  Xn  ^  X.  But  E(|Xn  —  X\  >  e)  = 
E(|2X|  >  e)  =  E(|X|  >  e/2)  ^  0.  So  Xn  does  not  converge  to  X  in  probability. 
■ 

Warning!  One  might  conjecture  that  if  Xn 6,  then  E(Xn)  -A  b.  This  is 
not1  true.  Let  Xn  be  a  random  variable  defined  by  E(Xn  =  n2)  =  1  jn  and 
E(Xn  =  0)  =  1  —  (1  jn).  Now,  P(|Xn|  <  e)  =  E(Xn  =  0)  =  1  —  (1  jn)  -A  1. 
Hence,  Xn— ^a0.  However,  E(Xn)  =  [n2  x  (1  jn)\  +  [0  x  (1  —  (1/n))]  =  n.  Thus, 
E(Xn)  -a  oo. 

Summary.  Stare  at  Figure  5.2. 

Some  convergence  properties  are  preserved  under  transformations. 

5.5  Theorem.  Let  Xn,X,  Yn,Y  be  random  variables.  Let  g  be  a  continuous 
function. 

(a)  If  Xn  -A  X  and  Yn-U  Y,  then  Xn+Yn-^X  +  Y. 

(b)  If  Xn^X  and  Yn-^  Y,  then  Xn  +  Yn^%  X  +  Y. 

(c)  If  Xn  ^  X  and  Yn  ^  c,  then  Xn  +  Yn  ^  X  +  c. 

(d)  If  and  Yn-—?-  Y,  then  XnYn-^XY. 

(e)  If  Xn  ^  X  and  Yn  ^  c,  then  XnYn  ^  cX. 

(f)  If  Xn  —A  X ,  theng(Xn)^g(X). 

(g)  If  Xn  -  X,  then  g(Xn)  g(X). 

Parts  (c)  and  (e)  are  know  as  Slutzky’s  theorem.  It  is  worth  noting  that 
Xn  ^  X  and  Yn  ^  Y  does  not  in  general  imply  that  Xn  +  Yn  ^  X  +  Y. 

1\Ne  can  conclude  that  K(Xn)  — »•  b  if  Xn  is  uniformly  integrable.  See  the  appendix. 
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5.3  The  Law  of  Large  Numbers 

Now  we  come  to  a  crowning  achievement  in  probability,  the  law  of  large  num¬ 
bers.  This  theorem  says  that  the  mean  of  a  large  sample  is  close  to  the  mean 
of  the  distribution.  For  example,  the  proportion  of  heads  of  a  large  number 
of  tosses  is  expected  to  be  close  to  1/2.  We  now  make  this  more  precise. 

Let  Xi,X2, ...  be  an  HD  sample,  let  p  =  E(Xi)  and  2  a2  =  V(Xi).  Recall 
that  the  sample  mean  is  defined  as  Xn  =  n_1  Y^i=i  and  that  E(Xn)  =  fi 
and  Y(Xn)  =  a2  jn. 

5.6  Theorem  (The  Weak  Law  of  Large  Numbers  (WLLN)).  3 

—  p 

If  X  i,...,Xn  are  I  ID,  then  Xn — >  p. 


Interpretation  of  the  WLLN:  The  distribution  of  Xn  becomes  more 
concentrated  around  g  as  n  gets  large. 


Proof.  Assume  that  a  <  oo.  This  is  not  necessary  but  it  simplifies  the 
proof.  Using  Chebyshev’s  inequality, 


P(|Vn  — >e)  < 
which  tends  to  0  as  n  o  oo.  ■ 


5.7  Example.  Consider  flipping  a  coin  for  which  the  probability  of  heads  is 
p.  Let  Xi  denote  the  outcome  of  a  single  toss  (0  or  1).  Hence,  p  =  P(Xi  = 
1)  =  E(Xi).  The  fraction  of  heads  after  n  tosses  is  Xn.  According  to  the  law 
of  large  numbers,  Xn  converges  to  p  in  probability.  This  does  not  mean  that 
Xn  will  numerically  equal  p.  It  means  that,  when  n  is  large,  the  distribution 
of  Xn  is  tightly  concentrated  around  p.  Suppose  that  p  =  1/2.  How  large 
should  n  be  so  that  P(.4  <  Xn  <  .6)  >  .7?  First,  E(Xn)  =  p  =  1/2  and 
V(Xn)  =  a2 jn  =  p(l  —  p)/n  =  l/(4n).  From  Chebyshev’s  inequality, 

E(.4  <  ~Xn  <  .6)  =  P(|Xn  -  fi\  <  .1) 

=  1  -E(|Xn  -/i|  >  .1) 

1  25 

>  1 - =  1 - . 

4n(.l)2  n 

The  last  expression  will  be  larger  than  .7  if  n  =  84.  ■ 


2 Note  that  fi  =  E(N4)  is  the  same  for  all  i  so  we  can  define  fi  =  E(X^)  for  any  i.  By 
convention,  we  often  write  n  =  E(Vi). 

Where  is  a  stronger  theorem  in  the  appendix  called  the  strong  law  of  large  numbers. 


5.4  The  Central  Limit  Theorem 
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The  law  of  large  numbers  says  that  the  distribution  of  Xn  piles  up  near  /i. 
This  isn’t  enough  to  help  us  approximate  probability  statements  about  Xn. 
For  this  we  need  the  central  limit  theorem. 

Suppose  that  Xi, . . . ,  Xn  are  HD  with  mean  fi  and  variance  a 2 .  The  central 
limit  theorem  (CLT)  says  that  Xn  =  n-1  Y2iXi  has  a  distribution  which  is 
approximately  Normal  with  mean  fi  and  variance  a2  jn.  This  is  remarkable 
since  nothing  is  assumed  about  the  distribution  of  Xiy  except  the  existence  of 
the  mean  and  variance. 


5.8  Theorem  (The  Central  Limit  Theorem  (CLT)).  Let  Xi, . . . ,  Xn  be  iid 

with  mean  fi  and  variance  a2 .  Let  Xn  =  n_1  Then 

ry  _  Xn-jJ,  y/n(Xn-n) 

Zjn  =  - !  —  -  ^  A 

VV(^n) 

where  Z  ^  N( 0, 1).  In  other  words , 


lim  P(Zn  <  z) 

n—>  oo 


Interpretation:  Probability  statements  about  Xn  can  be  approximated 
using  a  Normal  distribution.  It’s  the  probability  statements  that  we 
are  approximating,  not  the  random  variable  itself. 

In  addition  to  Zn  ^  N( 0, 1),  there  are  several  forms  of  notation  to  denote 
the  fact  that  the  distribution  of  Zn  is  converging  to  a  Normal.  They  all  mean 
the  same  thing.  Here  they  are: 
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the  number  of  errors  in  the  programs.  We  want  to  approximate  E(Xn  <  5.5). 
Let  (i  =  E(Xi)  =  A  =  5  and  a2  =  V(Xi)  =  A  =  5.  Then, 


¥(Xn  <  5.5) 


p  fVn(Xn-n)  <  y/n{ 5.5-/x)\ 
P (Z  <  2.5)  =  .9938.  ■ 


The  central  limit  theorem  tells  us  that  Zn  =  ^/n(Xn—fL) /a  is  approximately 
N(0,1).  However,  we  rarely  know  a.  Later,  we  will  see  that  we  can  estimate 
a2  from  X\, . . . ,  Xn  by 


1 


n  —  1 


^2(Xi-Xn)2. 

i=  1 


This  raises  the  following  question:  if  we  replace  a  with  Sn,  is  the  central  limit 
theorem  still  true?  The  answer  is  yes. 


5.10  Theorem.  Assume  the  same  conditions  as  the  CLT.  Then, 


\fh{Xn  -  n) 


S. 


-V~-> 


N{  0,1)- 


n 


You  might  wonder,  how  accurate  the  normal  approximation  is.  The  answer 
is  given  in  the  Berry-Esseen  theorem. 

5.11  Theorem  (The  Berry-Esseen  Inequality).  Suppose  that E|Xi|3  <  oo.  Then 


sup  \¥(Zn  <  z)  —  $(z)|  < 


33  E  X\  —  /i 


4 


ncr 


(5.4) 


There  is  also  a  multivariate  version  of  the  central  limit  theorem. 


5.12  Theorem  (Multivariate  central  limit  theorem). 

dom  vectors  where 


(  XU  \ 


Let  X\, . . . ,  Xn  be  HD  ran- 


\  xkl  ) 


with  mean 


Mi  ^ 

,  /  E(VU)  \ 

h2 

— 

E  (X2i) 

\  Tk  ) 

V  E(Xfcj)  / 

5.5  The  Delta  Method 


79 


and  variance  matrix  E.  Let 


where  Xj  =  n  1  Then, 

y/n(X  -  fl)  -v~->  N(  0,E). 


5.5  The  Delta  Method 

If  Yn  has  a  limiting  Normal  distribution  then  the  delta  method  allows  us  to 
find  the  limiting  distribution  of  g(Yn )  where  g  is  any  smooth  function. 


5.14  Example.  Let  X\, . . .  ,Xn  be  HD  with  finite  mean  /x  and  finite  variance 
a2.  By  the  central  limit  theorem,  y/n(Xn  —  /a) /a  ^  iV(0, 1).  Let  Wn  =  eXn. 
Thus,  Wn  =  g(Xn)  where  g(s)  =  es.  Since  gf(s)  =  es,  the  delta  method 
implies  that  Wn  «  7V(eM,  e2^a2/n).  u 

There  is  also  a  multivariate  version  of  the  delta  method. 

5.15  Theorem  (The  Multivariate  Delta  Method).  Suppose  thatYn  =  (Yn i, . . .  ,Ynk) 
is  a  sequence  of  random  vectors  such  that 


Vn(Yn  -  n)  N(0,  S). 
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Let  g  :  — )►  R  and  let 


Xg(y) 


Let  Vjj,  denote  Vg(y)  evaluated  at  y 
are  nonzero.  Then 


\  ^ 

\  dVk  / 

/a  and  assume  that  the  elements  of  VM 


Mg(Yn)  -  m(m))  -  n  (o,  v££VM) 


5.16  Example.  Let 


Vn 

X2i 


X12 

X22 


,  .  .  .  , 


Xi  n 
X2  n 


be  HD  random  vectors  with  mean  fi  =  (/ii,/i2)T  and  variance  E.  Let 


n 


n 


I2  =  -Vx 

Z— /  rn  f  ^ 


2  i 


i=  1 


n 


i= 1 


and  define  Yn  =  X1X2.  Thus,  Yn  =  g(X  1^X2)  where  g(si,  S2)  =  siS2-  By  the 
central  limit  theorem, 


Xi  —  ii\ 
X2  ~  h2 


-w  iV( 0,  E) . 


Now 


Xg{s) 


dg 

dSl 

dg 

ds2 


and  so 


=  (M2  Mi) 


<711  <7 12 

<712  <722 


Therefore, 


h>2 

hi 


s  2 
Si 


l4all  +  2/11^20-12  +  Ml  cr22- 


^(^1X2  -  M1M2)  N[  0 ,/4(Jn  +  2fliH2<Tl2  +  Mi<722  •  ■ 


5.6  Bibliographic  Remarks 

Convergence  plays  a  central  role  in  modern  probability  theory.  For  more  de¬ 
tails,  see  Grimmett  and  Stirzaker  (1982),  Karr  (1993),  and  Billingsley  (1979). 
Advanced  convergence  theory  is  explained  in  great  detail  in  van  der  Vaart 
and  Wellner  (1996)  and  and  van  der  Vaart  (1998). 
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5.7  Appendix 

5.7.1  Almost  Sure  and  L\  Convergence 

We  say  that  Xn  converges  almost  surely  to  X,  written  Xn-^>X,  if 


P({s:  Xn{s)  -»■  X(s)})  =  1. 


We  say  that  Xn  converges  in  L\  to  X,  written  Xn— bX,  if 


EXn-X 


0 


as  n  -A  oo. 


5.17  Theorem.  Let  Xn  and  X  be  random  variables.  Then: 


(a)  Xn  X  implies  that  Xn  — >  X . 


(b)  Xn^Ax  implies  that  InAI. 

(c)  Xn— bX  implies  that  XnAx. 


L 


The  weak  law  of  large  numbers  says  that  Xn  converges  to  E(Xi)  in  proba¬ 
bility.  The  strong  law  asserts  that  this  is  also  true  almost  surely. 


5.18  Theorem  (The  Strong  Law  of  Large  Numbers).  Let  Xi,X2, ...  be  iid.  If 

fi  =  E|Xi|  <  oo  then  Xn-^b  fi. 


A  sequence  Xn  is  asymptotically  uniformly  integrable  if 


lim  limsupE  (|Xn|/(|Xn 

oo  n_>.  oo 


>  M))  =  0. 


p 

5.19  Theorem.  If  Xn — >b  and  Xn  is  asymptotically  uniformly  integrable, 
then  E(Xn)  -A  b. 


5.7.2  Proof  of  the  Central  Limit  Theorem 

Recall  that  if  X  is  a  random  variable,  its  moment  generating  function  (mgf) 
is  Lx(t)  =  Eetx.  Assume  in  what  follows  that  the  MGF  is  finite  in  a  neigh¬ 
borhood  around  t  =  0. 

5.20  Lemma.  Let  Z\,  Z2, ...  be  a  sequence  of  random  variables.  Let  ipn  be  the 
MGF  of  Zn.  Let  Z  be  another  random  variable  and  denote  its  MGF  by  ip-  If 
Ln(f)  -A  ipif)  for  all  t  in  some  open  interval  around  0,  then  Zn  ^  Z . 


82 


5.  Convergence  of  Random  Variables 


PROOF  OF  THE  CENTRAL  LIMIT  THEOREM.  Let  Y*  =  (X*  —  )l) / (J .  Then, 
Zn  =  n-1/2  JT  IV  Let  be  the  MGF  of  Y^.  The  MGF  of  JT  ^  18 
and  MGF  of  Zn  is  [ip(t/y^n)]n  =  £n(£).  Now  ^'(O)  =  E(Yi)  =  0,  ^"(O)  = 
E(Y12)  =  Y(Yi)  =  1.  So, 


Now, 


2  3 

m  =  m+^,(o)  +  t-r(o)  +  t-r,(o)  +  --- 

2  3 

=  l  +  0+f-  +  |y"(0)  +  --- 

=  i  +  |-  +  |jC,(0)  +  --- 


Ut)  = 


ip 


i  + 


i  + 


e*2/2 


t 


~i  n 


t 


n 

2 


n  n 


2n  +  3!^^  (°)  + 

T  +  H&*r'(0)+- 

n 


-i  n 


which  is  the  MGF  of  a  N(0,1).  The  result  follows  from  the  previous  Theorem. 
In  the  last  step  we  used  the  fact  that  if  an  -A  a  then 


1  + 


a 


n 


n 


n 


-A  e 


a 


5.8  Exercises 

1.  Let  Xi, . . .  ,Xn  be  HD  with  finite  mean  fi  =  E(Xi)  and  finite  variance 
a2  =  V(Xi).  Let  Xn  be  the  sample  mean  and  let  S 2  be  the  sample 
variance. 

(a)  Show  that  E (S'2)  =  <r2. 

(b)  Show  that  S2— ^A  cr2.  Hint:  Show  that  S2  =  cnn~l  Y17=i  ~  dnX2n 

where  cn  -a  1  and  dn  -a  1.  Apply  the  law  of  large  numbers  to  n  ELW 
and  to  Xn.  Then  use  part  (e)  of  Theorem  5.5. 

qm 

2.  Let  Xi,X2, ...  be  a  sequence  of  random  variables.  Show  that  Xn — >b 
if  and  only  if 

lim  E(Xn)  =  b  and  lim  V(Xn)  =  0. 

n—>  oo  n—>  oo 


5.8  Exercises 
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3.  Let  Xi, . . . ,  Xn  be  HD  and  let  /i  =  E(Xi).  Suppose  that  the  variance  is 
finite.  Show  that  Xn-^>  p. 

4.  Let  Xi,  X2, ...  be  a  sequence  of  random  variables  such  that 


1 


P  I  Xn  =  -  =  1  - 

n  nz 


1  1 

and  P  (Xn  =  n)  = 


re 


Does  Xn  converge  in  probability?  Does  Xn  converge  in  quadratic  mean? 


5.  Let  Xi, . . . ,  Xn  Bernoulli(p).  Prove  that 


1 


n 


— y  x2  p  and  —  X, 


1 


n 


2  qm 


>p. 


i= 1 


i= 1 


6.  Suppose  that  the  height  of  men  has  mean  68  inches  and  standard  de¬ 
viation  2.6  inches.  We  draw  100  men  at  random.  Find  (approximately) 
the  probability  that  the  average  height  of  men  in  our  sample  will  be  at 
least  68  inches. 


7.  Let  An  =  1/n  for  n  =  1,  2, . . ..  Let  Xn  ^  Poisson(An). 

p 

(a)  Show  that  Xn — ^0. 

p 

(b)  Let  Yn  =  nXn.  Show  that  Yn — ^0. 


8.  Suppose  we  have  a  computer  program  consisting  of  n  =  100  pages  of 
code.  Let  Xi  be  the  number  of  errors  on  the  ith  page  of  code.  Suppose 
that  the  X[s  are  Poisson  with  mean  1  and  that  they  are  independent. 
Let  Y  =  XlILi  be  the  total  number  of  errors.  Use  the  central  limit 
theorem  to  approximate  ¥(Y  <  90). 


9.  Suppose  that  P(X  =  1)  =  P(X  =  —1)  =  1/2.  Define 


f  X  with  probability  1  —  ^ 
en  with  probability 


Does  Xn  converge  to  X  in  probability?  Does  Xn  converge  to  X  in  dis¬ 
tribution?  Does  E(X  —  Xn)2  converge  to  0? 


10.  Let  Z  ^  N( 0, 1).  Let  t  >  0.  Show  that,  for  any  k  >  0, 


P(|Z|  >t)< 


El  Z 


k 


tk  ' 


Compare  this  to  Mill’s  inequality  in  Chapter  4. 
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11.  Suppose  that  Xn  ~  X(0,l/n)  and  let  X  be  a  random  variable  with 
distribution  F(x)  =  0  if  x  <  0  and  F(x)  =  1  if  x  >  0.  Does  Xn  converge 
to  X  in  probability?  (Prove  or  disprove).  Does  Xn  converge  to  X  in 
distribution?  (Prove  or  disprove). 

12.  Let  X,  Xi,  X2,  X3, ...  be  random  variables  that  are  positive  and  integer 
valued.  Show  that  Xn  ^  X  if  and  only  if 


lim  P(Xn  =  k)=  P(X  =  k) 

n—±  00 

for  every  integer  k. 

13.  Let  Zi,Z2,...  be  HD  random  variables  with  density  /.  Suppose  that 
P(Z^  >  0)  =  1  and  that  A  =  lim^o  f(x)  >  0-  Let 

=  n  min{Zi, . . . ,  Zn}. 


Show  that  Xn  ^  Z  where  Z  has  an  exponential  distribution  with  mean 
1/A. 

14.  Let  Xi, . . . ,  Xn  ^  Uniform(0, 1).  Let  Yn  =  X^.  Find  the  limiting  distri¬ 
bution  of  Yn. 

15.  Let 

(  Xn  \  (  X12  A  (  Xln\ 

\x21  )'  \x22  ,/’•••’  \x2n  ) 

be  HD  random  vectors  with  mean  fi  =  (/ii,/i2)  and  variance  E.  Let 


n 


n 


v  =  -yv,  x2  =  ivx2i 

Yl  Z— /  rn  {  ^ 


i=  1 


n 


i= 1 


and  define  Yn  =  X\jX2.  Find  the  limiting  distribution  of  Yn. 

16.  Construct  an  example  where  Xn  ^  X  and  Yn  ^  Y  but  Xn  +  Yn  does 
not  converge  in  distribution  to  X  +  Y . 


Part  II 


Statistical  Inference 


6 

Models,  Statistical  Inference  and 
Learning 


6.1  Introduction 

Statistical  inference,  or  “learning”  as  it  is  called  in  computer  science,  is  the 
process  of  using  data  to  infer  the  distribution  that  generated  the  data.  A 
typical  statistical  inference  question  is: 

Given  a  sample  Xi, . . . ,  Xn  ~  F,  how  do  we  infer  FI 

In  some  cases,  we  may  want  to  infer  only  some  feature  of  F  such  as  its 
mean. 


6.2  Parametric  and  Nonparametric  Models 

A  statistical  model  $  is  a  set  of  distributions  (or  densities  or  regression 
functions).  A  parametric  model  is  a  set  $  that  can  be  parameterized  by  a 
finite  number  of  parameters.  For  example,  if  we  assume  that  the  data  come 
from  a  Normal  distribution,  then  the  model  is 

ff=  I  f(x\n,a)  =  ^  1— exp  -  ^)2| ,  n  e  R,  a  >  oj  .  (6.1) 

This  is  a  two-parameter  model.  We  have  written  the  density  as  f(x;  fi ,  a)  to 
show  that  x  is  a  value  of  the  random  variable  whereas  fi  and  a  are  parameters. 
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In  general,  a  parametric  model  takes  the  form 


$  =  {  f(x;  6)  :  0  £  © 


where  0  is  an  unknown  parameter  (or  vector  of  parameters)  that  can  take 
values  in  the  parameter  space  0.  If  9  is  a  vector  but  we  are  only  interested  in 
one  component  of  0,  we  call  the  remaining  parameters  nuisance  parameters. 
A  nonparametric  model  is  a  set  $  that  cannot  be  parameterized  by  a  finite 
number  of  parameters.  For  example,  SAll  =  {all  CDF's}  is  nonparametric.  1 

6.1  Example  (One-dimensional  Parametric  Estimation).  Let  Xi,  . . .,  Xn  be  in¬ 
dependent  Bernoulli (p)  observations.  The  problem  is  to  estimate  the  param¬ 
eter  p.  u 

6.2  Example  (Two-dimensional  Parametric  Estimation).  Suppose  that  Xi,  . . ., 
Xn  F  and  we  assume  that  the  PDF  /  £  $  where  #  is  given  in  (6.1).  In 
this  case  there  are  two  parameters,  p  and  a.  The  goal  is  to  estimate  the 
parameters  from  the  data.  If  we  are  only  interested  in  estimating  /i,  then  p  is 
the  parameter  of  interest  and  a  is  a  nuisance  parameter.  ■ 

6.3  Example  (Nonparametric  estimation  of  the  cdf).  Let  Xi,  . . .,  Xn  be  inde¬ 
pendent  observations  from  a  CDF  F.  The  problem  is  to  estimate  F  assuming 
only  that  F  £  S^all  =  {all  CDF's}.  ■ 

6.4  Example  (Nonparametric  density  estimation).  Let  Xi, . . .  ,Xn  be  indepen¬ 
dent  observations  from  a  CDF  F  and  let  /  =  F'  be  the  PDF.  Suppose  we  want 
to  estimate  the  PDF  /.  It  is  not  possible  to  estimate  /  assuming  only  that 
F  £  Sall.  We  need  to  assume  some  smoothness  on  /.  For  example,  we  might 
assume  that  /  £  $  =  #dens  H^sob  where  #dens  Is  the  se^  of  all  probability 
density  functions  and 

3sob  =  {/  :  [ (f"(x))2dx  <  ool . 


The  class  S^sob  Is  called  a  Sobolev  space;  it  is  the  set  of  functions  that  are 
not  “too  wiggly.”  ■ 

6.5  Example  (Nonparametric  estimation  of  functionals).  Let  Xi,  . . .,  Xn  ^  F. 

Suppose  we  want  to  estimate  p  =  E(Xi)  =  f  xdF(x)  assuming  only  that 


1_The  distinction  between  parametric  and  nonparametric  is  more  subtle  than  this  but  we  don’t 
need  a  rigorous  definition  for  our  purposes. 


6.2  Parametric  and  Nonparametric  Models 
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l±  exists.  The  mean  ji  may  be  thought  of  as  a  function  of  F:  we  can  write 
M  =  T(F)  =  f  xdF(x).  In  general,  any  function  of  F  is  called  a  statis¬ 
tical  functional.  Other  examples  of  functionals  are  the  variance  T(F)  = 
f  x2dF(x )  —  (f  xdF(x ))2  and  the  median  T(F)  =  F-1(  1/2).  ■ 

6.6  Example  (Regression,  prediction,  and  classification).  Suppose  we  observe  pairs 
of  data  (Xi,  Yi), . . .  (Xn,Yn).  Perhaps  Xi  is  the  blood  pressure  of  subject  i 
and  Yi  is  how  long  they  live.  X  is  called  a  predictor  or  regressor  or  fea¬ 
ture  or  independent  variable.  Y  is  called  the  outcome  or  the  response 
variable  or  the  dependent  variable.  We  call  r(x)  =  E(Y|X  =  x)  the  re¬ 
gression  function.  If  we  assume  that  r  G  J  where  #  is  finite  dimensional  — 
the  set  of  straight  lines  for  example  —  then  we  have  a  parametric  regres¬ 
sion  model.  If  we  assume  that  r  E  $  where  $  is  not  finite  dimensional  then 
we  have  a  nonparametric  regression  model.  The  goal  of  predicting  Y  for 
a  new  patient  based  on  their  X  value  is  called  prediction.  If  Y  is  discrete 
(for  example,  live  or  die)  then  prediction  is  instead  called  classification.  If 
our  goal  is  to  estimate  the  function  r,  then  we  call  this  regression  or  curve 
estimation.  Regression  models  are  sometimes  written  as 

Y  =  r(X)  +  e  (6.3) 

where  E(e)  =  0.  We  can  always  rewrite  a  regression  model  this  way.  To  see 
this,  define  e  =  Y  —  r(X)  and  hence  Y  =  Y  +  r(X)  —  r(X)  =  r(X)  +  e. 
Moreover,  E(e)  =  EE(e|X)  =  E(E(Y  -  r(X))|X)  =  E(E(Y|X)  -  r(X))  = 
E(r(X)  —  r(X))  =  0.  ■ 

What’s  Next?  It  is  traditional  in  most  introductory  courses  to  start  with 
parametric  inference.  Instead,  we  will  start  with  nonparametric  inference  and 
then  we  will  cover  parametric  inference.  In  some  respects,  nonparametric  in¬ 
ference  is  easier  to  understand  and  is  more  useful  than  parametric  inference. 

Frequentists  and  Bayesians.  There  are  many  approaches  to  statistical 
inference.  The  two  dominant  approaches  are  called  frequentist  inference 
and  Bayesian  inference.  We’ll  cover  both  but  we  will  start  with  frequentist 
inference.  We’ll  postpone  a  discussion  of  the  pros  and  cons  of  these  two  until 
later. 

Some  Notation.  If  $  =  {/(#;  6)  :  0  e  0}  is  a  parametric  model,  we  write 

(X  G  A)  =  fAf(x;  6)dx  and  E#(r(X))  =  f  r(x)f(x;  0)dx.  The  subscript  6 
indicates  that  the  probability  or  expectation  is  with  respect  to  f(x;  0 );  it  does 
not  mean  we  are  averaging  over  6.  Similarly,  we  write  Ye  for  the  variance. 
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6.3  Fundamental  Concepts  in  Inference 

Many  inferential  problems  can  be  identified  as  being  one  of  three  types:  es¬ 
timation,  confidence  sets,  or  hypothesis  testing.  We  will  treat  all  of  these 
problems  in  detail  in  the  rest  of  the  book.  Here,  we  give  a  brief  introduction 
to  the  ideas. 

6.3.1  Point  Estimation 

Point  estimation  refers  to  providing  a  single  “best  guess”  of  some  quantity 
of  interest.  The  quantity  of  interest  could  be  a  parameter  in  a  parametric 
model,  a  CDF  F,  a  probability  density  function  /,  a  regression  function  r,  or 
a  prediction  for  a  future  value  Y  of  some  random  variable. 

By  convention,  we  denote  a  point  estimate  of  9  by  9  or  6n.  Remember 
that  8  is  a  fixed,  unknown  quantity.  The  estimate  8  depends  on  the 
data  so  9  is  a  random  variable. 

More  formally,  let  Xi, . . . ,  Xn  be  n  HD  data  points  from  some  distribution 
F.  A  point  estimator  0n  of  a  parameter  8  is  some  function  of  Xi, . . . ,  Xn\ 

@n  —  9  (X"l  y  •  •  •  y  Xn )  . 

The  bias  of  an  estimator  is  defined  by 

bias(#n)  =  Mo(0n)  -  0.  (6.4) 

We  say  that  0n  is  unbiased  if  E (8n)  =  6.  Unbiasedness  used  to  receive  much 
attention  but  these  days  is  considered  less  important;  many  of  the  estimators 
we  will  use  are  biased.  A  reasonable  requirement  for  an  estimator  is  that  it 
should  converge  to  the  true  parameter  value  as  we  collect  more  and  more 
data.  This  requirement  is  quantified  by  the  following  definition: 

6.7  Definition.  A  point  estimator  6n  of  a  parameter  9  is  consistent  if 


The  distribution  of  9n  is  called  the  sampling  distribution.  The  standard 
deviation  of  6n  is  called  the  standard  error,  denoted  by  se: 

se  =  se(0„)  =  f  ¥($„).  (6.5) 

Often,  the  standard  error  depends  on  the  unknown  F.  In  those  cases,  se  is 
an  unknown  quantity  but  we  usually  can  estimate  it.  The  estimated  standard 
error  is  denoted  by  se. 
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6.8  Example.  Let  X\, . . . ,  Xn  ^  Bernoulli(p)  and  let  pn  =  n_1  JQ.  Then 
E(pn)  =  n_1  JTE(JQ)  =  p  so  pn  is  unbiased.  The  standard  error  is  se  = 
\/^V(pn)  =  \/p(l  —  p)/n.  The  estimated  standard  error  is  se  =  \/p(l  —  p)  jn. 

■ 

The  quality  of  a  point  estimate  is  sometimes  assessed  by  the  mean  squared 
error,  or  MSE  defined  by 

mse  =  E#(#n  —  9)2.  (6.6) 

Keep  in  mind  that  E@(-)  refers  to  expectation  with  respect  to  the  distribution 

n 

f(x 9)  =  II/(xi;  9) 

i=  1 

that  generated  the  data.  It  does  not  mean  we  are  averaging  over  a  distribution 
for  0. 

6.9  Theorem.  The  mse  can  be  written  as 

mse  =  bias2(#n)  +  Ye(9n).  (6.7) 

Proof.  Let  0n  =  Ee(6n).  Then 

E$ {On  ~  0)2  =  Ee(6n  —  9n  +  6n  —  6)2 

=  E  o(0n  —  0n)2  +  2  (0n  —  0)Eo(0n  —  0n)  +  E  o(9n  —  0)2 
=  (6n  —  6)2  +  E  o(On  —  6n)2 

=  bias2(#n)  +  V(#n) 

where  we  have  used  the  fact  that  E o(0n  —  0n)  =  0n  —  0n  =  0.  ■ 

6.10  Theorem.  If  bias  — >>  0  and  se  — 0  as  n  ^  oo  then  0n  is  consistent,  that 

^  p 

is,  0n — >  0. 

Proof.  If  bias  — )►  0  and  se  — )►  0  then,  by  Theorem  6.9,  MSE  — )►  0.  It 
follows  that  0n^—>0.  (Recall  Dehnition  5.2.)  The  result  follows  from  part  (b) 
of  Theorem  5.4.  ■ 

6.11  Example.  Returning  to  the  coin  flipping  example,  we  have  that  E P(pn)  = 
p  so  the  bias  =  p  —  p  =  0  and  se  =  yjp{  1  —  p)/n  — )►  0.  Hence,  pn-^p,  that  is, 
pn  is  a  consistent  estimator.  ■ 

Many  of  the  estimators  we  will  encounter  will  turn  out  to  have,  approxi¬ 
mately,  a  Normal  distribution. 
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6.12  Definition.  An  estimator  is  asymptotically  Normal  if 

0n  —  9 

-  -V~-> 

se 


6.3.2  Confidence  Sets 

A  1  —  a  confidence  interval  for  a  parameter  9  is  an  interval  Cn  =  (a,  b) 
where  a  =  a(X i, . . . ,  Xn)  and  b  =  b(X i, . . . ,  Xn)  are  functions  of  the  data 
such  that 

P e{0  G  Cn)  >  1  —  ce,  for  all  0  G  0.  (6-9) 

In  words,  (a,  6)  traps  6  with  probability  1  —  a.  We  call  1  —  a  the  coverage  of 
the  confidence  interval. 

Warning!  Cn  is  random  and  0  is  fixed. 

Commonly,  people  use  95  percent  confidence  intervals,  which  corresponds 
to  choosing  a  =  0.05.  If  9  is  a  vector  then  we  use  a  confidence  set  (such  as 
a  sphere  or  an  ellipse)  instead  of  an  interval. 

Warning!  There  is  much  confusion  about  how  to  interpret  a  confidence 
interval.  A  confidence  interval  is  not  a  probability  statement  about  9  since 
9  is  a  fixed  quantity,  not  a  random  variable.  Some  texts  interpret  confidence 
intervals  as  follows:  if  I  repeat  the  experiment  over  and  over,  the  interval  will 
contain  the  parameter  95  percent  of  the  time.  This  is  correct  but  useless  since 
we  rarely  repeat  the  same  experiment  over  and  over.  A  better  interpretation 
is  this: 

On  day  1,  you  collect  data  and  construct  a  95  percent  confidence 
interval  for  a  parameter  9\ .  On  day  2,  you  collect  new  data  and  con¬ 
struct  a  95  percent  confidence  interval  for  an  unrelated  parameter  #2. 

On  day  3,  you  collect  new  data  and  construct  a  95  percent  confi¬ 
dence  interval  for  an  unrelated  parameter  #3.  You  continue  this  way 
constructing  confidence  intervals  for  a  sequence  of  unrelated  param¬ 
eters  #1,  #2, . . .  Then  95  percent  of  your  intervals  will  trap  the  true 
parameter  value.  There  is  no  need  to  introduce  the  idea  of  repeating 
the  same  experiment  over  and  over. 

6.13  Example.  Every  day,  newspapers  report  opinion  polls.  For  example,  they 
might  say  that  “83  percent  of  the  population  favor  arming  pilots  with  guns.” 
Usually,  you  will  see  a  statement  like  “this  poll  is  accurate  to  within  4  points 
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95  percent  of  the  time.”  They  are  saying  that  83  ±4  is  a  95  percent  confidence 
interval  for  the  true  but  unknown  proportion  p  of  people  who  favor  arming 
pilots  with  guns.  If  you  form  a  confidence  interval  this  way  every  day  for  the 
rest  of  your  life,  95  percent  of  your  intervals  will  contain  the  true  parameter. 
This  is  true  even  though  you  are  estimating  a  different  quantity  (a  different 
poll  question)  every  day.  ■ 


6.14  Example.  The  fact  that  a  confidence  interval  is  not  a  probability  state¬ 
ment  about  9  is  confusing.  Consider  this  example  from  Berger  and  Wolpert 
(1984).  Let  9  be  a  fixed,  known  real  number  and  let  X\,X2  be  independent 
random  variables  such  that  P(X^  =  1)  =  P(X^  =  —1)  =  1/2.  Now  define 
Yi  =  9  +  Xi  and  suppose  that  you  only  observe  Y\  and  >2.  Define  the  follow¬ 
ing  “confidence  interval”  which  actually  only  contains  one  point: 


C 


{V  -  1}  if  v  =  y2 

{(Yi+Y2)/2}  ifYi^Y2. 


You  can  check  that,  no  matter  what  9  is,  we  have  Fq(9  G  C)  =  3/4  so  this 
is  a  75  percent  confidence  interval.  Suppose  we  now  do  the  experiment  and 
we  get  Y\  =  15  and  Y2  =  17.  Then  our  75  percent  confidence  interval  is  {16}. 
However,  we  are  certain  that  9  =  16.  If  you  wanted  to  make  a  probability 
statement  about  9  you  would  probably  say  that  P(0  E  C|Yi,  Y2)  =  L  There  is 
nothing  wrong  with  saying  that  {16}  is  a  75  percent  confidence  interval.  But 
is  it  not  a  probability  statement  about  9.  u 


In  Chapter  11  we  will  discuss  Bayesian  methods  in  which  we  treat  9  as  if  it 
were  a  random  variable  and  we  do  make  probability  statements  about  9.  In 
particular,  we  will  make  statements  like  “the  probability  that  9  is  in  Cn,  given 
the  data,  is  95  percent.”  However,  these  Bayesian  intervals  refer  to  degree- 
of-belief  probabilities.  These  Bayesian  intervals  will  not,  in  general,  trap  the 
parameter  95  percent  of  the  time. 

6.15  Example.  In  the  coin  flipping  setting,  let  Cn  =  (pn  —  en,  pn  +  en)  where 
=  log(2/a)/(2n).  From  Hoeffding’s  inequality  (4.4)  it  follows  that 


P(p  E  Cn)  >  1  -  a 


for  every  p.  Hence,  Cn  is  a  1  —  a  confidence  interval.  ■ 

As  mentioned  earlier,  point  estimators  often  have  a  limiting  Normal  dis¬ 
tribution,  meaning  that  equation  (6.8)  holds,  that  is,  9n  ps  iV(0,  se2).  In  this 
case  we  can  construct  (approximate)  confidence  intervals  as  follows. 
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6.16  Theorem  (Normal-based  Confidence  Interval).  Suppose  thatOn  zz  N(0,se  ). 
Let  be  the  CDF  of  a  standard  Normal  and  let  zOL/2  =  <4>_1(1  —  (a/2)),  that 
is,  P [Z  >  za/2)  =  a/2  and  F(—za/ 2  <  Z  <  za/2)  =  1  —  a  where  Z  ~  N( 0, 1). 
Let 

Cn  =  (0n  —  za/2se,  0n-\-za/2  se).  (6.10) 

Then 

Fe(6  eCn)dl-a.  (6.11) 

Proof.  Let  Zn  =  (6n  —  0)/ se.  By  assumption  Zn  ^  Z  where  Z  iV(0,l). 
Hence, 


P#($  €=  Cn) 


0n  ~  Zfy/2  se  <  0  <  +  Zfy/2  se 


'ol/2 


n 


'ol/2 


■Zol/2  < 


0n  ~  0 


se 


<  za/2 


^  {~za/2  <  Z  <  Za/2j 

=  1  —  a.  m 


For  95  percent  confidence  intervals,  a  =  0.05  and  za/2  =  1.96  ~  2  leading 
to  the  approximate  95  percent  confidence  interval  0n  db  2  se. 

6.17  Example.  Let  X\, . . .  ,Xn  ^  Bernoulli(p)  and  let  pn  =  R_1Xir=i^- 
Then  Y(pn)  =  n-2  V(X^)  =  n-2  p(l  -  p)  =  n~2np{l  -  p)  =  p(  1  - 

p)/n.  Hence,  se  =  ^/p(l  —  p)/n  and  se  =  \fpn{  1  —  pn) /n.  By  the  Central 
Limit  Theorem,  pn  zz  N(p,se2).  Therefore,  an  approximate  1  —  a  confidence 
interval  is 

-  ,  ^  -  ,  lpn(  1  ~Pn) 

Pn  ±  ^a/2Se  =  pn  ±  Za/2]l - - - . 

Compare  this  with  the  confidence  interval  in  example  6.15.  The  Normal-based 
interval  is  shorter  but  it  only  has  approximately  (large  sample)  correct  cover¬ 
age.  ■ 

6.3.3  Hypothesis  Testing 

In  hypothesis  testing,  we  start  with  some  default  theory  —  called  a  null 
hypothesis  —  and  we  ask  if  the  data  provide  sufficient  evidence  to  reject  the 
theory.  If  not  we  retain  the  null  hypothesis.  2 


2The  term  “retaining  the  null  hypothesis”  is  due  to  Chris  Genovese.  Other  terminology  is 
“accepting  the  null”  or  “failing  to  reject  the  null.” 
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6.18  Example  (Testing  if  a  Coin  is  Fair).  Let 

Xi, . . . ,  Xn  Bernoulli(p) 


be  n  independent  coin  flips.  Suppose  we  want  to  test  if  the  coin  is  fair.  Let  Hq 
denote  the  hypothesis  that  the  coin  is  fair  and  let  H i  denote  the  hypothesis 
that  the  coin  is  not  fair.  Ho  is  called  the  null  hypothesis  and  Hi  is  called 
the  alternative  hypothesis.  We  can  write  the  hypotheses  as 


Ho  :  p  =  1/2  versus  Hi  /  1/2, 


It  seems  reasonable  to  reject  Ho  if  T  =  \pn  —  (1/2)  |  is  large.  When  we  discuss 
hypothesis  testing  in  detail,  we  will  be  more  precise  about  how  large  T  should 
be  to  reject  Hq.  ■ 


6.4  Bibliographic  Remarks 

Statistical  inference  is  covered  in  many  texts.  Elementary  texts  include  DeG- 
root  and  Schervish  (2002)  and  Larsen  and  Marx  (1986).  At  the  intermediate 
level  I  recommend  Casella  and  Berger  (2002),  Bickel  and  Doksum  (2000),  and 
Rice  (1995).  At  the  advanced  level,  Cox  and  Hinkley  (2000),  Lehmann  and 
Casella  (1998),  Lehmann  (1986),  and  van  der  Vaart  (1998). 


6.5  Appendix 

Our  definition  of  confidence  interval  requires  that  ¥q(6  G  Cn)  >  1  —  a 
for  all  6^  G  ©.  A  pointwise  asymptotic  confidence  interval  requires  that 
limud/^oo  P&(9  G  Cn)  >  1  —  a  for  all  6  G  ©.  A  uniform  asymptotic  con¬ 
fidence  interval  requires  that  lim  ud/^oo  inf0G©  ¥>0(6  G  Cn)  >  1  —  a.  The 
approximate  Normal-based  interval  is  a  pointwise  asymptotic  confidence  in¬ 
terval. 


6.6  Exercises 

1.  Let  Xi, . . .  ,Xn  ~  Poisson(A)  and  let  A  =  n-1  XAi  A-  Find  the  bias, 
se,  and  MSE  of  this  estimator. 

2.  Let  Xi, . . . ,  Xn  ~  Uniform(0,  6)  and  let  9  =  max{Xi, . . . ,  Xn}.  Find  the 
bias,  se,  and  MSE  of  this  estimator. 
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3.  Let  Xi, . . .  ,  Xn  ^  Uniform(0,  9)  and  let  9  =  2Xn.  Find  the  bias,  se,  and 
MSE  of  this  estimator. 


7 

Estimating  the  CDF  and  Statistical 
Functionals 


The  first  inference  problem  we  will  consider  is  nonparametric  estimation  of  the 
CDF  F.  Then  we  will  estimate  statistical  functionals,  which  are  functions  of 
CDF,  such  as  the  mean,  the  variance,  and  the  correlation.  The  nonparametric 
method  for  estimating  functionals  is  called  the  plug-in  method. 


7.1  The  Empirical  Distribution  Function 

Let  Xi, . . .  ,Xn  ^  F  be  an  HD  sample  where  F  is  a  distribution  function  on 
the  real  line.  We  will  estimate  F  with  the  empirical  distribution  function, 
which  is  defined  as  follows. 


7.1  Definition.  The  empirical  distribution  function  Fn  is  the  CDF 

that  puts  mass  1  jn  at  each  data  point  Xi.  Formally , 


EIU  TV  <  x) 

n 


(7.1) 


where 


I(Xi  <  x) 


1  if  Xi  <  x 
0  if  Xi  >  x. 
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FIGURE  7.1.  Nerve  data.  Each  vertical  line  represents  one  data  point.  The  solid 
line  is  the  empirical  distribution  function.  The  lines  above  and  below  the  middle 
line  are  a  95  percent  confidence  band. 


7.2  Example  (Nerve  Data).  Cox  and  Lewis  (1966)  reported  799  waiting  times 
between  successive  pulses  along  a  nerve  fiber.  Figure  7.1  shows  the  empirical 
CDF  Fn.  The  data  points  are  shown  as  small  vertical  lines  at  the  bottom  of 
the  plot.  Suppose  we  want  to  estimate  the  fraction  of  waiting  times  between 
.4  and  .6  seconds.  The  estimate  is  Fn(.6)  —  Fn(A)  =  .93  —  .84  =  .09.  ■ 


7.3  Theorem.  At  any  fixed  value  of  x , 


E  \  Fn(x) 
V  fn(x) 


MSE 

Fn(x) 


F(x ), 

F(x)(l  -F(x)) 


n 


F(x)(l-F(x)) 


n 


— )►  0, 


-»  F(x) 


7 A  Theorem  (The  Glivenko-Cantelli  Theorem).  Let  Xi, . . .  ,Xn  ~  F.  Then  1 


sup  \Fn(x)  —  F(x) | — >  0. 

x 

Now  we  give  an  inequality  that  will  be  used  to  construct  a  confidence  band. 


xMore  precisely,  sup^  | Fn(x)  —  F(x) \  converges  to  0  almost  surely. 
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7.5  Theorem  (The  Dvoretzky-Kiefer-Wolfowitz  (DKW)  Inequality).  LetX i,  ... 
Xn  rsj  F.  Then ,  for  any  e  >  0, 


P  sup  | F(x)  —  Fn(x) |  >  e  <  2e 


—  2  ne" 


(7.2) 


From  the  DKW  inequality,  we  can  construct  a  confidence  set  as  follows: 


A  Nonparametric  1  —  a  Confidence  Band  for  F 


Define. 


L(x) 

U(x) 


ma x{Fn(x)  -  en,  0} 
min{Fn(a;)  +  en,  1} 


where 


1 


— 


2  n 


log  - 


(!) 


It  follows  from  (7.2)  that  for  any  F. 


P(  L(x)  <  F(x)  <  U(x)  for  all  x  )  >  1  —  a. 


(7.3) 


7.6  Example.  The  dashed  lines  in  Figure  7.1  give  a  95  percent  confidence 
band  using  e„  =  J ^  log(^)  =  .048.  ■ 


7.2  Statistical  Functionals 

A  statistical  functional  T(F)  is  any  function  of  F.  Examples  are  the  mean 
M  =  f  xdF(x),  the  variance  a2  =  f  (x  —  fi)2  dF(x)  and  the  median  m  = 
F-1(l/2). 

7.7  Definition.  The  plug-in  estimator  of  0  =  T(F)  is  defined  by 

On  =  T(Fn). 

In  other  words ,  just  plug  in  Fn  for  the  unknown  F . 


7.8  Definition.  IfT(F)  =  f  r(x)dF(x)  for  some  function  r(x)  then  T  is 
called  a  linear  functional. 
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The  reason  T(F)  =  f  r(x)dF(x)  is  called  a  linear  functional  is  because  T 
satisfies 

T(aF  +  bG)  =  aT(F )  +  bT(G), 

hence  T  is  linear  in  its  arguments.  Recall  that  f  r(x)dF(x)  is  defined  to  be 
f  r(x)f(x)dx  in  the  continuous  case  and  JT  r(xj)f(xj)  in  the  discrete.  The 
empirical  cdf  Fn(x)  is  discrete,  putting  mass  1  jn  at  each  X^.  Hence,  if  T(F)  = 
f  r(x)dF(x)  is  a  linear  functional  then  we  have: 


Sometimes  we  can  find  the  estimated  standard  error  se  of  T(Fn)  by  doing 
some  calculations.  However,  in  other  cases  it  is  not  obvious  how  to  estimate 
the  standard  error.  In  the  next  chapter,  we  will  discuss  a  general  method  for 
finding  se.  For  now,  let  us  just  assume  that  somehow  we  can  find  se. 

In  many  cases,  it  turns  out  that 

T(Fn)  «  N(T(F),se2).  (7.5) 

By  equation  (6.11),  an  approximate  1  —  a  confidence  interval  for  T(F)  is  then 

T(Fn)  ±  z^/2  se.  (7.6) 

We  will  call  this  the  Normal-based  interval.  For  a  95  percent  confidence 
interval,  2  —  ^.05/2  —  1-96  2  so  the  interval  is 

T(Fn)  =b  2  se. 

7.10  Example  (The  mean).  Let  /i  =  T(F)  =  f  xdF(x).  The  plug-in  estima¬ 
tor  is  ]jl  =  f  xdFn(x)  =  Xn.  The  standard  error  is  se  =  ^V(Xn)  =  cr / y/n.  If 
a  denotes  an  estimate  of  cr,  then  the  estimated  standard  error  is  a/y/n.  (In 
the  next  example,  we  shall  see  how  to  estimate  a.)  A  Normal-based  confidence 
interval  for  fi  is  Xn  ±  se*  ■ 

7.11  Example  (The  Variance).  Let  a2  =  T(F)  =  V(X)  =  f  x2dF(x)  —  (f  xdF(x))2. 
The  plug-in  estimator  is 
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1 

n 

1 

n 


Y^{Xi-Xn)2. 


2 


Another  reasonable  estimator  of  a2 


is  the  sample  variance 


1 


n  —  1 


-  xnf. 

i—  1 


In  practice,  there  is  little  difference  between  d2  and  S2  and  you  can  use  either 
one.  Returning  to  the  last  example,  we  now  see  that  the  estimated  standard 
error  of  the  estimate  of  the  mean  is  se  =  a  U/n.  u 


7.12  Example  (The  Skewness).  Let  p  and  cr2  denote  the  mean  and  variance 
of  a  random  variable  X.  The  skewness  is  defined  to  be 

E(X  —  /i)3  f  (x  —  p)sdF(x) 

0-3  {  f  (x  —  ju)2dF(x)}3/2 


The  skewness  measures  the  lack  of  symmetry  of  a  distribution.  To  find  the 
plug-in  estimate,  first  recall  that  p  =  n-1  JT  Xi  and  a2  =  n_1  JT(JQ  —  p)2. 
The  plug-in  estimate  of  n  is 


K 


f(x-  p)3dFn(x) 

3/2 

x  —  p)2dFn(x)  > 


cr 


7.13  Example  (Correlation).  Let  Z  =  (X,  Y)  and  let  p  =  T{F)  =  E(X  — 
Px)(y  ~  ^y) / {vxVy)  denote  the  correlation  between  X  and  T,  where  F(x,y) 
is  bivariate.  We  can  write 


T(F)  =  a(Ti(F),  T2(F),  T3(F),  T4(F),  T6(F)) 


where 


T1(F)=fxdF(z),  T2(F)  =  fydF(z),  T3(F)  =  fxydF(z), 
T4(F)=fx2dF(z),  T5(F)  =  fy2dF(z), 


and 


H  —  ^1^2 

V(t4  -tl) 


Replace  F  with  Fn  in  T4(F),  . . . ,  T3(F),  and  take 


p  =  a(T4(Fn),T2(Fn),T3(Fn),T4(Fn),T5(Fn)). 
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We  get 


J2i(Xi-Xn)(Yi-Yn) 

,'Zi{Xi-Xn)2j'£i{Yi-Yn)* 


which  is  called  the  sample  correlation. 


7.14  Example  (Quantiles).  Let  F  be  strictly  increasing  with  density  /.  For 
0  <  p  <  1,  the  pth  quantile  is  defined  by  T{F)  =  F_1(p).  The  estimate  if 
T(F)  is  F~1(p).  We  have  to  be  a  bit  careful  since  Fn  is  not  invertible.  To 
avoid  ambiguity  we  define 

FpiP)  =  infD  :  Fn{x)  >  p}. 

We  call  T(Fn)  =  F~1(p)  the  pth  sample  quantile.  ■ 


Only  in  the  first  example  did  we  compute  a  standard  error  or  a  confidence 
interval.  How  shall  we  handle  the  other  examples?  When  we  discuss  parametric 
methods,  we  will  develop  formulas  for  standard  errors  and  confidence  intervals. 
But  in  our  nonparametric  setting  we  need  something  else.  In  the  next  chapter, 
we  will  introduce  the  bootstrap  for  getting  standard  errors  and  confidence 
intervals. 


7.15  Example  (Plasma  Cholesterol).  Figure  7.2  shows  histograms  for  plasma 
cholesterol  (in  mg/dl)  for  371  patients  with  chest  pain  (Scott  et  al.  (1978)). 
The  histograms  show  the  percentage  of  patients  in  10  bins.  The  first  histogram 
is  for  51  patients  who  had  no  evidence  of  heart  disease  while  the  second 
histogram  is  for  320  patients  who  had  narrowing  of  the  arteries.  Is  the  mean 
cholesterol  different  in  the  two  groups?  Let  us  regard  these  data  as  samples 
from  two  distributions  F\  and  F2.  Let  pi  =  f  xdFi(x)  and  ju2  =  f  xdF2(x) 
denote  the  means  of  the  two  populations.  The  plug-in  estimates  are  pi  = 
f  xdFn  i{x)  =  Xn  i  =  195.27  and  ju2  =  f  xdFUi 2{x)  =  XUy2  =  216.19.  Recall 
that  the  standard  error  of  the  sample  mean  p  =  ^  Y^i= 1  ^ 


which  we  estimate  by 


where 
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For  the  two  groups  this  yields  se(pi)  =  5.0  and  se(/i2)  =  2.4.  Approximate  95 
percent  confidence  intervals  for  /i i  and  fi 2  are  /ii  db  2se(/ii)  =  (185,205)  and 
/I2  ±  2se(/i2)  =  (211,  221). 

Now,  consider  the  functional  6  =  T(^)  —  T(F\)  whose  plug-in  estimate  is 
0  =  /I2  —  jli  =  216.19  —  195.27  =  20.92.  The  standard  error  of  0  is 

se  =  v/V(M2  -  Mi)  =  V^(P2)  +  V(/Ii)  =  a/ (se(jui))2  +  (se(fi2))2 

and  we  estimate  this  by 

se  =  v7  (se(/ii))2  +  (se(/i2))2  =  5.55. 

An  approximate  95  percent  confidence  interval  for  0  is  #±2  se(0n)  =  (9.8,  32.0). 
This  suggests  that  cholesterol  is  higher  among  those  with  narrowed  arteries. 
We  should  not  jump  to  the  conclusion  (from  these  data)  that  cholesterol  causes 
heart  disease.  The  leap  from  statistical  evidence  to  causation  is  very  subtle 
and  is  discussed  in  Chapter  16.  ■ 


100  150  200  250  300  350  400 

plasma  cholesterol  for  patients  without  heart  disease 


100  150  200  250  300  350  400 

plasma  cholesterol  for  patients  with  heart  disease 


FIGURE  7.2.  Plasma  cholesterol  for  51  patients  with  no  heart  disease  and  320 
patients  with  narrowing  of  the  arteries. 
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7.3  Bibliographic  Remarks 

The  Glivenko-Cantelli  theorem  is  the  tip  of  the  iceberg.  The  theory  of  dis¬ 
tribution  functions  is  a  special  case  of  what  are  called  empirical  processes 
which  underlie  much  of  modern  statistical  theory.  Some  references  on  empiri¬ 
cal  processes  are  Shorack  and  Wellner  (1986)  and  van  der  Vaart  and  Wellner 
(1996). 

7.4  Exercises 

1.  Prove  Theorem  7.3. 

2.  Let  Xi, . . . ,  Xn  rsj  Bernoulli(p)  and  let  Y\, . . . ,  Ym  ~  Bernoulli(g).  Find 
the  plug-in  estimator  and  estimated  standard  error  for  p.  Find  an  ap¬ 
proximate  90  percent  confidence  interval  for  p.  Find  the  plug-in  esti¬ 
mator  and  estimated  standard  error  for  p  —  q.  Find  an  approximate  90 
percent  confidence  interval  for  p  —  q. 

3.  (Computer  Experiment.)  Generate  100  observations  from  a  N(0,1)  dis¬ 
tribution.  Compute  a  95  percent  confidence  band  for  the  CDF  F  (as 
described  in  the  appendix).  Repeat  this  1000  times  and  see  how  often 
the  confidence  band  contains  the  true  distribution  function.  Repeat  us¬ 
ing  data  from  a  Cauchy  distribution. 

4.  Let  Xi, . . .  ,Xn  ~  F  and  let  Fn(x)  be  the  empirical  distribution  func¬ 
tion.  For  a  fixed  x,  use  the  central  limit  theorem  to  find  the  limiting 
distribution  of  Fn(x). 

5.  Let  x  and  y  be  two  distinct  points.  Find  Cov(Fn  (x),Fn(y)). 

6.  Let  Xi, . . .  ,Xn  ^  F  and  let  F  be  the  empirical  distribution  function. 
Let  a  <  b  be  fixed  numbers  and  define  6  =  T(F)  =  F(b)  —  F(a).  Fet 
6  =  T(Fn)  =  Fn(b)  —  Fn(a).  Find  the  estimated  standard  error  of  6. 
Find  an  expression  for  an  approximate  1  —  a  confidence  interval  for  0. 

7.  Data  on  the  magnitudes  of  earthquakes  near  Fiji  are  available  on  the 
website  for  this  book.  Estimate  the  CDF  F(x).  Compute  and  plot  a  95 
percent  confidence  envelope  for  F  (as  described  in  the  appendix).  Find 
an  approximate  95  percent  confidence  interval  for  F( 4.9)  —  F( 4.3). 
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8.  Get  the  data  on  eruption  times  and  waiting  times  between  eruptions  of 
the  Old  Faithful  geyser  from  the  website.  Estimate  the  mean  waiting 
time  and  give  a  standard  error  for  the  estimate.  Also,  give  a  90  percent 
confidence  interval  for  the  mean  waiting  time.  Now  estimate  the  median 
waiting  time.  In  the  next  chapter  we  will  see  how  to  get  the  standard 
error  for  the  median. 

9.  100  people  are  given  a  standard  antibiotic  to  treat  an  infection  and 
another  100  are  given  a  new  antibiotic.  In  the  first  group,  90  people 
recover;  in  the  second  group,  85  people  recover.  Let  pi  be  the  probability 
of  recovery  under  the  standard  treatment  and  let  P2  be  the  probability  of 
recovery  under  the  new  treatment.  We  are  interested  in  estimating  0  = 
Pi  —  P2-  Provide  an  estimate,  standard  error,  an  80  percent  confidence 
interval,  and  a  95  percent  confidence  interval  for  9. 

10.  In  1975,  an  experiment  was  conducted  to  see  if  cloud  seeding  produced 
rainfall.  26  clouds  were  seeded  with  silver  nitrate  and  26  were  not.  The 
decision  to  seed  or  not  was  made  at  random.  Get  the  data  from 

http://lib.stat.cmu.edu/DASL/Stories/CloudSeeding.html 

Let  9  be  the  difference  in  the  mean  precipitation  from  the  two  groups. 
Estimate  6.  Estimate  the  standard  error  of  the  estimate  and  produce  a 
95  percent  confidence  interval. 


8 

The  Bootstrap 


The  bootstrap  is  a  method  for  estimating  standard  errors  and  computing 
confidence  intervals.  Let  Tn  =  g(X i, . . . ,  Xn )  be  a  statistic,  that  is,  Tn  is  any 
function  of  the  data.  Suppose  we  want  to  know  Vf(Tn),  the  variance  of  Tn. 
We  have  written  to  emphasize  that  the  variance  usually  depends  on  the 
unknown  distribution  function  F.  For  example,  if  Tn  =  Xn  then  Y  p(Tn)  = 
a2 /n  where  a2  =  f(x  —  g)2dF(x)  and  fi  =  f  xdF(x).  Thus  the  variance  of  Tn 
is  a  function  of  F.  The  bootstrap  idea  has  two  steps: 

Step  1:  Estimate  Vi?(Tn)  with  Yp  (Tn). 


Step  2:  Approximate  Vp  (Tn)  using  simulation. 

Jd  n  ,  v  7 


For  Tn  =  Xnj  we  have  for  Step  1  that  Y p  (Tn)  =  a2  jn  where  a2  =  n_1  — 

Xn).  In  this  case,  Step  1  is  enough.  However,  in  more  complicated  cases  we 
cannot  write  down  a  simple  formula  for  Vp  (Tn)  which  is  why  we  need  Step 
2.  Before  proceeding,  let  us  discuss  the  idea  of  simulation. 
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8.1  Simulation 


Suppose  we  draw  an  HD  sample  Yi, . . . ,  Yb  from  a  distribution  G.  By  the  law 
of  large  numbers, 

2 ^ 

w  =  4  E  [  ydG w  =  E(y) 

3  =  1  J 


as  B  oo.  So  if  we  draw  a  large  sample  from  G,  we  can  use  the  sample 
mean  Yn  to  approximate  E(Y).  In  a  simulation,  we  can  make  5  as  large  as 
we  like,  in  which  case,  the  difference  between  Yn  and  E(Y)  is  negligible.  More 
generally,  if  h  is  any  function  with  finite  mean  then 


1 

B 


Evv) 


3  = 1 


h(y)dG(y)  =  E(h(Y)) 


as  B  oo.  In  particular, 


1 

B 


B 


Et  - y)2  = 

3  = 1 


y2dF(y) 


V(Y). 


Hence,  we  can  use  the  sample  variance  of  the  simulated  values  to  approximate 
V(Y). 


8.2  Bootstrap  Variance  Estimation 

According  to  what  we  just  learned,  we  can  approximate  (Tn)  by  Simula- 
tion.  Now  (Tn)  means  “the  variance  of  Tn  if  the  distribution  of  the  data 
is  Fnr  How  can  we  simulate  from  the  distribution  of  Tn  when  the  data  are 
assumed  to  have  distribution  Fn ?  The  answer  is  to  simulate  X*, . . . ,  X*  from 
Fn  and  then  compute  T*  =  g(X^  . . . ,  X*).  This  constitutes  one  draw  from 
the  distribution  of  Tn.  The  idea  is  illustrated  in  the  following  diagram: 

Real  world  F  =>  Xi, . . . ,  Xn  Tn  =  g(X i, . . . ,  Xn) 

Bootstrap  world  Fn  =^>  XJ5, . . . ,  X*  T*  =  g(X J5, . . . ,  X*) 

How  do  we  simulate  XJ5, . . . ,  X*  from  Fn ?  Notice  that  Fn  puts  mass  1/n  at 
each  data  point  Xi, . . . ,  Xn.  Therefore, 


8.2  Bootstrap  Variance  Estimation 
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drawing  an  observation  from  Fn  is  equivalent  to  drawing 
one  point  at  random  from  the  original  data  set. 

Thus,  to  simulate  XJ5, . . .  ,  X*  ^  Fn  it  suffices  to  draw  n  observations  with 
replacement  from  Xi, . . . ,  Xn.  Here  is  a  summary: 


Bootstrap  Variance  Estimation 

1.  Draw  X*, . . .  ,  X*  ^  Fn. 

2.  Compute  T*  =  #(X*, . . . ,  X*). 

3.  Repeat  steps  1  and  2,  B  times,  to  get  T*?1, . . . ,  T*  B. 

4.  Let 

B  /  B  \  2 

vboot  —  -g  (  ^n,6  —  -g  ^n,r  j  •  (^-l) 

6=1  \  r=l  / 


8.1  Example.  The  following  pseudocode  shows  how  to  use  the  bootstrap  to 
estimate  the  standard  error  of  the  median. 

Bootstrap  for  The  Median 

Given  data  X  =  (X(l),  X(n)): 

T  <-  median(X) 

Tboot  <-  vector  of  length  B 
for(i  in  1:B){ 

Xstar  <-  sample  of  size  n  from  X  (with  replacement) 

Tboot [i]  <-  median(Xstar) 

> 

se  <-  sqrt (variance (Tboot) ) 

The  following  schematic  diagram  will  remind  you  that  we  are  using  two 
approximations : 


not  so  small  small 

VF(Tn)  ~  Vp  (Tn)  ~  v b00t • 


8.2  Example.  Consider  the  nerve  data.  Let  6  =  T(F)  =  f  (x—ju)3dF(x)/a3  be 
the  skewness.  The  skewness  is  a  measure  of  asymmetry.  A  Normal  distribution, 
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for  example,  has  skewness  0.  The  plug-in  estimate  of  the  skewness  is 

J(x-v)3dFn(x)  ±Ei=l  (Xj-Xn 

<73  CT3 

To  estimate  the  standard  error  with  the  bootstrap  we  follow  the  same  steps 
as  with  the  median  example  except  we  compute  the  skewness  from  each 
bootstrap  sample.  When  applied  to  the  nerve  data,  the  bootstrap,  based  on 
B  =  1,000  replications,  yields  a  standard  error  for  the  estimated  skewness  of 

.16.  ■ 

8.3  Bootstrap  Confidence  Intervals 

There  are  several  ways  to  construct  bootstrap  confidence  intervals.  Here  we 
discuss  three  methods. 

Method  1:  The  Normal  Interval.  The  simplest  method  is  the  Normal  interval 

Tn  i  ^a/2  seboot  (8.2) 

where  seboot  =  y/Vboot  is  the  bootstrap  estimate  of  the  standard  error.  This 
interval  is  not  accurate  unless  the  distribution  of  Tn  is  close  to  Normal. 

Method  2:  Pivotal  Intervals.  Let  9  =  T(F)  and  9n  =  T(Fn)  and  define  the 
pivot  Rn  =  6n  —  6.  Let  0*}1, . . . ,  B  denote  bootstrap  replications  of  6n.  Let 
H(r)  denote  the  CDF  of  the  pivot: 

H(r)  =¥F(Rn  <  r).  (8.3) 


Define  C*  =  (a,  b)  where 

a  =  0n-  H-1  (l  -  and  b  =  9n-  H~l 
It  follows  that 


P(a  <  6  <  b) 


P(a  —  6n  <  6  —  6n  <  b  —  9n) 


P(9n  ~ 

IP(^n  “ 
H{6n 
H  (h 

a 


9  <  0n  —  0  <  0n  —  a) 

b  <  Rn  <  6n  —  a) 


—  a)  —  H(9n 


b) 

hIh-1 


a 

2 


1 
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Hence,  C*  is  an  exact  1  —  a  confidence  interval  for  9.  Unfortunately,  a  and  b 
depend  on  the  unknown  distribution  H  but  we  can  form  a  bootstrap  estimate 
of  H: 


1 


B 


H(r)  =  -  £  I(K,b  <  r) 


(8.5) 


6=1 


where  i?*  b  =  0*  b— 6n.  Let  denote  the  f3  sample  quantile  of  (R^ i, . . . ,  R^b) 
and  let  Op  denote  the  (3  sample  quantile  of  (0*?1, . . .  ,  0*  ^).  Note  that  r ^  = 
6p  —  6n.  It  follows  that  an  approximate  1  —  a  confidence  interval  is  Cn  =  (a,  6) 
where 


a  —  0n  H  ^1  2  )  —  rl-a/2~^n  @l-ot/2 

b  =  =0n-r*a/2  =2  en-0*a/2. 

In  summary,  the  1  —  a  bootstrap  pivotal  confidence  interval  is 

Cn  =  (20n  -  0l_a/2,  2 0n  -  e*a/2)  .  (8.6) 


8.3  Theorem.  Under  weak  conditions  on  T(F ), 


¥F(T(F)eCn)^l-a 


as  n  oc,  where  Cn  is  given  in  (8.6). 

Method  3:  Percentile  Intervals.  The  bootstrap  percentile  interval  is  de¬ 
fined  by 

ri  _  ( n*  n*  \ 

t/i_a/2 J  • 

The  justihcation  for  this  interval  is  given  in  the  appendix. 

8.4  Example.  For  estimating  the  skewness  of  the  nerve  data,  here  are  the 
various  confidence  intervals. 

Method  95%  Interval 

Normal  (1-44,  2.09) 

Pivotal  (1-48,  2.11) 

Percentile  (1.42,  2.03) 

All  these  confidence  intervals  are  approximate.  The  probability  that  T(F) 
is  in  the  interval  is  not  exactly  1  —  a.  All  three  intervals  have  the  same  level 
of  accuracy.  There  are  more  accurate  bootstrap  confidence  intervals  but  they 
are  more  complicated  and  we  will  not  discuss  them  here. 
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8.5  Example  (The  Plasma  Cholesterol  Data).  Let  us  return  to  the  cholesterol 
data.  Suppose  we  are  interested  in  the  difference  of  the  medians.  Pseudocode 
for  the  bootstrap  analysis  is  as  follows: 

xl  <-  first  sample 
x2  <-  second  sample 
nl  <-  length(xl) 
n2  <-  length (x2) 

th.hat  <-  median (x2)  -  median (xl) 

B  <-  1000 

Tboot  <-  vector  of  length  B 
for(i  in  1:B){ 

xxl  <-  sample  of  size  nl  with  replacement  from  xl 
xx2  <-  sample  of  size  n2  with  replacement  from  x2 
Tboot  [i]  <-  median(xx2)  -  median(xxl) 

} 

se  <-  sqrt (variance (Tboot) ) 

Normal  <-  (th.hat  -  2*se,  th.hat  +  2*se) 

percentile  <-  (quantile (Tboot ,. 025) ,  quantile (Tboot ,. 975) ) 

pivotal  <-  (  2*th. hat-quantile (Tboot ,. 975) , 

2*th . hat-quantile (Tboot , . 025)  ) 

The  point  estimate  is  18.5,  the  bootstrap  standard  error  is  7.42  and  the  re¬ 
sulting  approximate  95  percent  confidence  intervals  are  as  follows: 

Method  95%  Interval 

Normal  (3.7,  33.3) 

Pivotal  (5.0,  34.0) 

Percentile  (5.0,  33.3) 

Since  these  intervals  exclude  0,  it  appears  that  the  second  group  has  higher 
cholesterol  although  there  is  considerable  uncertainty  about  how  much  higher 
as  reflected  in  the  width  of  the  intervals.  ■ 

The  next  two  examples  are  based  on  small  sample  sizes.  In  practice,  sta¬ 
tistical  methods  based  on  very  small  sample  sizes  might  not  be  reliable.  We 
include  the  examples  for  their  pedagogical  value  but  we  do  want  to  sound  a 
note  of  caution  about  interpreting  the  results  with  some  skepticism. 

8.6  Example.  Here  is  an  example  that  was  one  of  the  first  used  to  illustrate 
the  bootstrap  by  Bradley  Efron,  the  inventor  of  the  bootstrap.  The  data  are 
LSAT  scores  (for  entrance  to  law  school)  and  GPA. 
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LSAT  576 

635 

558 

578 

666 

580 

555 

661 

651 

605 

653 

575 

545 

572 

594 

GPA 

3.39 

3.30 

2.81 

3.03 

3.44 

3.07 

3.00  3.43 

3.36 

3.13 

3.12 

2.74 

2.76 

2.88 

3.96 

Each  data  point  is  of  the  form  Xi  =  (Y^,  Z^)  where  =  LSAT^  and  Z i  = 
GPA^.  The  law  school  is  interested  in  the  correlation 

e  =  J  f(y  ~  Hy){z  -  nz)dF(y,  z) 

\J f(y  ~  Mr)2dF(y )  f(z  -  /uz)2dF(z) 

The  plug-in  estimate  is  the  sample  correlation 

p  YsAYj -Y^Zj -Z) 

j'Li(Yi-YYY,i{zi-zy 

The  estimated  correlation  is  0  =  .776.  The  bootstrap  based  on  B  =  1000 
gives  se  =  .137.  Figure  8.1  shows  the  data  and  a  histogram  of  the  bootstrap 
replications  0*, . . . ,  0*B.  This  histogram  is  an  approximation  to  the  sampling 
distribution  of  6.  The  Normal-based  95  percent  confidence  interval  is  .78  ± 
2se  =  (.51, 1.00)  while  the  percentile  interval  is  (.46, .96).  In  large  samples,  the 
two  methods  will  show  closer  agreement.  ■ 


8.7  Example.  This  example  is  from  Efron  and  Tibshirani  (1993).  When  drug 
companies  introduce  new  medications,  they  are  sometimes  required  to  show 
bioequivalence.  This  means  that  the  new  drug  is  not  substantially  different 
than  the  current  treatment.  Here  are  data  on  eight  subjects  who  used  medi¬ 
cal  patches  to  infuse  a  hormone  into  the  blood.  Each  subject  received  three 

treatments:  placebo,  old-patch,  new-patch. 

subject  placebo  old  new  old  —  placebo  new  —  old 


1 

9243 

17649 

16449 

8406 

-1200 

2 

9671 

12013 

14614 

2342 

2601 

3 

11792 

19979 

17274 

8187 

-2705 

4 

13357 

21816 

23798 

8459 

1982 

5 

9055 

13850 

12560 

4795 

-1290 

6 

6290 

9806 

10157 

3516 

351 

7 

12412 

17208 

16570 

4796 

-638 

8 

18806 

29044 

26325 

10238 

-2719 
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FIGURE  8.1.  Law  school  data.  The  top  panel  shows  the  raw  data.  The  bottom  panel 
is  a  histogram  of  the  correlations  computed  from  each  bootstrap  sample. 


Let  Z  =  old  —  placebo  and  Y  =  new  —  old.  The  Food  and  Drug  Adminis¬ 
tration  (FDA)  requirement  for  bioequivalence  is  that  \6\  <  .20  where 

=  EfOO 
E  F(Z)' 

The  plug-in  estimate  of  6  is 


Y  _  -452.3 
^  “  6342 


-0.0713. 


The  bootstrap  standard  error  is  se  =  0.105.  To  answer  the  bioequivalence 
question,  we  compute  a  confidence  interval.  From  B  =  1000  bootstrap  repli¬ 
cations  we  get  the  95  percent  interval  (-0.24,0.15).  This  is  not  quite  contained 


8.4  Bibliographic  Remarks 
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in  (-0.20,0.20)  so  at  the  95  percent  level  we  have  not  demonstrated  bioequiv¬ 
alence.  Figure  8.2  shows  the  histogram  of  the  bootstrap  values.  ■ 


i - 1 - 1 - 1 - 1 - 1 

-0.3  -0.2  -0.1  0.0  0.1  0.2 


Bootstrap  Samples 

FIGURE  8.2.  Patch  data. 


8.4  Bibliographic  Remarks 

The  bootstrap  was  invented  by  Efron  (1979).  There  are  several  books  on  these 
topics  including  Efron  and  Tibshirani  (1993),  Davison  and  Hinkley  (1997), 
Hall  (1992)  and  Shao  and  Tu  (1995).  Also,  see  section  3.6  of  van  der  Vaart 
and  Wellner  (1996). 


8.5  Appendix 

8. 5. 1  The  Jackknife 

There  is  another  method  for  computing  standard  errors  called  the  jackknife, 
due  to  Quenouille  (1949).  It  is  less  computationally  expensive  than  the  boot- 
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strap  but  is  less  general.  Let  Tn  =  T(X i, . . . ,  Xn)  be  a  statistic  and  T(_q  de¬ 
note  the  statistic  with  the  ith  observation  removed.  Let  Tn  =  n  -'Ehn-iy 
The  jackknife  estimate  of  var(Tn)  is 


Ljack  —  r^n^2 

i=  1 

and  the  jackknife  estimate  of  the  standard  error  is  sejack  =  ^ /vjack •  Under 
suitable  conditions  on  T,  it  can  be  shown  that  Ujack  consistently  estimates 

var(Tn)  in  the  sense  that  Ujack/var(Tn) — >  1.  However,  unlike  the  bootstrap, 
the  jackknife  does  not  produce  consistent  estimates  of  the  standard  error  of 
sample  quantiles. 


8.5.2  Justification  For  The  Percentile  Interval 

Suppose  there  exists  a  monotone  transformation  U  =  m(T)  such  that  U  ~ 
7V(</),c2)  where  <fr  =  m{6).  We  do  not  suppose  we  know  the  transformation, 
only  that  one  exists.  Let  U b  =  m($*  6).  Let  u ^  be  the  (3  sample  quantile  of 
the  Uf  s.  Since  a  monotone  transformation  preserves  quantiles,  we  have  that 
ua/2  =  Also,  since  U  ~  iV(0,  c2),  the  a/2  quantile  of  U  is  <fi  —  za/2c. 

Hence  u*  ^  =  </>  —  za/2c.  Similarly,  =  0  +  za/2c.  Therefore, 


<6  <6 


l-ct/2 


) 


Hni(e*a/2)  <  m(0)  <  m{0\_a/2)) 

P«/2  <  *  <  K-a/2) 

P(C/  -  CZaj  2  <4><U  +  cza/2) 

U  -<j> 

P(-^a/2  <  -  <  Za/2) 

c 

1  —  a. 


An  exact  normalizing  transformation  will  rarely  exist  but  there  may  exist 
approximate  normalizing  transformations. 


8.6  Exercises 

1.  Consider  the  data  in  Example  8.6.  Find  the  plug-in  estimate  of  the 
correlation  coefficient.  Estimate  the  standard  error  using  the  bootstrap. 
Find  a  95  percent  confidence  interval  using  the  Normal,  pivotal,  and 
percentile  methods. 


8.6  Exercises 
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2.  (Computer  Experiment.)  Conduct  a  simulation  to  compare  the  various 
bootstrap  confidence  interval  methods.  Let  n  =  50  and  let  T(F)  = 
f (x  —  ju)3dF(x)/a3  be  the  skewness.  Draw  Yi,...,Yn  ~  X(0, 1)  and 
set  Xi  =  eY%  i  =  1, . . .  ,n.  Construct  the  three  types  of  bootstrap  95 
percent  intervals  for  T(F)  from  the  data  Xi, . . . ,  Xn.  Repeat  this  whole 
thing  many  times  and  estimate  the  true  coverage  of  the  three  intervals. 

3.  Let 

Xi, . . . ,  Xn  ~  ts 

where  n  =  25.  Let  6  =  T(F)  =  (q, 75  —  q, 25)/!. 34  where  qp  denotes  the 
pth  quantile.  Do  a  simulation  to  compare  the  coverage  and  length  of  the 
following  confidence  intervals  for  0:  (i)  Normal  interval  with  standard 
error  from  the  bootstrap,  (ii)  bootstrap  percentile  interval,  and  (iii) 
pivotal  bootstrap  interval. 

4.  Let  Xi, . . .  ,  Xn  be  distinct  observations  (no  ties).  Show  that  there  are 

C; ') 


distinct  bootstrap  samples. 

Hint:  Imagine  putting  n  balls  into  n  buckets. 

5.  Let  Xi, . . . ,  Xn  be  distinct  observations  (no  ties).  Let  X*, . . . ,  X *  denote 
a  bootstrap  sample  and  let  Xn  =  n  ~l  Ei= 1  X*.  Find:  E(XJX1; . . . ,  Xn), 
Y(X*n\X1,...,Xn),E(X*n)  and  V(X*). 

6.  (Computer  Experiment.)  Let  Xi,  ...,Xn  Normal(/i,  1).  Let  6  =  and  let 
9  =  ex .  Create  a  data  set  (using  /i  =  5)  consisting  of  n=100  observa¬ 
tions. 

(a)  Use  the  bootstrap  to  get  the  se  and  95  percent  confidence  interval 
for  0. 

(b)  Plot  a  histogram  of  the  bootstrap  replications.  This  is  an  estimate 
of  the  distribution  of  6.  Compare  this  to  the  true  sampling  distribution 
of  6. 

7.  Let  Xi, ...,  Xn  ^  Uniform(0,  9).  Let  9  =  Xmax  =  max{Xi, ...,  Xn}.  Gen¬ 
erate  a  data  set  of  size  50  with  9  =  1. 

(a)  Find  the  distribution  of  9.  Compare  the  true  distribution  of  9  to  the 
histograms  from  the  bootstrap. 
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(b)  This  is  a  case  where  the  bootstrap  does  very  poorly.  In  fact,  we  can 
prove  that  this  is  the  case.  Show  that  P(6  =  0)  =  0  and  yet  P(6*  = 
0)  «  .632.  Hint:  show  that,  P(Q*  =  0)  =  1  —  (1  —  (l/n))n  then  take  the 
limit  as  n  gets  large. 

8.  Let  Tn  =  n  =  E(Xi),  =  f  \x—ju\kdF(x)  and  =  n-1  J2i= i  \Xi~ 
Xn  k .  Show  that 


^boot 


9 

Parametric  Inference 


We  now  turn  our  attention  to  parametric  models,  that  is,  models  of  the  form 

S' =  I  f{x;6):  9  e  oj  (9.1) 

where  the  ©  C  Rk  is  the  parameter  space  and  6  =  (#1, . . . ,  6  k)  is  the  param¬ 
eter.  The  problem  of  inference  then  reduces  to  the  problem  of  estimating  the 
parameter  6. 

Students  learning  statistics  often  ask:  how  would  we  ever  know  that  the 
distribution  that  generated  the  data  is  in  some  parametric  model?  This  is 
an  excellent  question.  Indeed,  we  would  rarely  have  such  knowledge  which 
is  why  nonparametric  methods  are  preferable.  Still,  studying  methods  for 
parametric  models  is  useful  for  two  reasons.  First,  there  are  some  cases  where 
background  knowledge  suggests  that  a  parametric  model  provides  a  reasonable 
approximation.  For  example,  counts  of  traffic  accidents  are  known  from  prior 
experience  to  follow  approximately  a  Poisson  model.  Second,  the  inferential 
concepts  for  parametric  models  provide  background  for  understanding  certain 
nonparametric  methods. 

We  begin  with  a  brief  discussion  about  parameters  of  interest  and  nuisance 
parameters  in  the  next  section,  then  we  will  discuss  two  methods  for  estimat¬ 
ing  #,  the  method  of  moments  and  the  method  of  maximum  likelihood. 
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9.1  Parameter  of  Interest 


Often,  we  are  only  interested  in  some  function  T{6).  For  example,  if  X  ~ 
iV(/x,  cr2)  then  the  parameter  is  9  =  (/x,cr).  If  our  goal  is  to  estimate  fi  then 
fi  =  T(9)  is  called  the  parameter  of  interest  and  a  is  called  a  nuisance 
parameter.  The  parameter  of  interest  might  be  a  complicated  function  of  9 
as  in  the  following  example. 


9.1  Example.  Let  Xi, . . .  ,Xn  ~  Normal (/x,cr2).  The  parameter  is  6  =  (/x,  a) 
and  the  parameter  space  is  ©  =  {(/x,  cr)  :  fi  <G  R,  a  >  0}.  Suppose  that  Xi  is 
the  outcome  of  a  blood  test  and  suppose  we  are  interested  in  r,  the  fraction 
of  the  population  whose  test  score  is  larger  than  1.  Let  Z  denote  a  standard 
Normal  random  variable.  Then 


r 


V  1 

P(X  >  1)  =  1  -  P(X  <  1)  =  1  -  P  | — —  < 


a 


a 


=  l-P  [Z<- — M 


a 


a 


The  parameter  of  interest  is  r  =  T(/x,cr)  =  1  —  $((1  —  /x) /cr) -  ■ 

9.2  Example.  Recall  that  X  has  a  Gamma(<a,/3)  distribution  if 

/(x;  ^  =  X  >  0 

where  a,  /?  >  0  and 

pOO 

T(a)  =  /  ya~1e~vdy 

Jo 

is  the  Gamma  function.  The  parameter  is  9  =  (a, /3).  The  Gamma  distri¬ 
bution  is  sometimes  used  to  model  lifetimes  of  people,  animals,  and  elec¬ 
tronic  equipment.  Suppose  we  want  to  estimate  the  mean  lifetime.  Then 
T(a,(3)  =  EflpG)  =  a/3.  ■ 


9.2  The  Method  of  Moments 

The  first  method  for  generating  parametric  estimators  that  we  will  study 
is  called  the  method  of  moments.  We  will  see  that  these  estimators  are  not 
optimal  but  they  are  often  easy  to  compute.  They  are  are  also  useful  as  starting 
values  for  other  methods  that  require  iterative  numerical  routines. 


9.2  The  Method  of  Moments 
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Suppose  that  the  parameter  6  =  (Oi, ...  ,0k)  has  k  components.  For  1  < 
j  <  k,  define  the  jth  moment 


aj  =  ctj  (9)  =  Ee(Xj)  =  J  xjdF0(x) 


(9.2) 


and  the  jth  sample  moment 


(9.3) 


i= 1 


9.3  Definition.  The  method  of  moments  estimator  6n  is  defined  to  be 
the  value  of  0  such  that 

(fin) 

(0n)  =  OL2 


kifin ) 


(9.4) 


Formula  (9.4)  defines  a  system  of  k  equations  with  k  unknowns. 


9.4  Example.  Let  X\, 


? 


Bernoulli (p).  Then  aq  =  EP(X)  =  p  and 


=  n  1  J2i=1  Xi.  By  equating  these  we  get  the  estimator 


Pn  —  ^ 

n 

i=  1 


9.5  Example.  Let  X\, 


Normal(/q  a2).  Then,  aq  =  1&q(Xi)  =  fi 


and  02  =  'Eo(Xf)  =  Yq(X i)  +  (E6>(Xi))2  =  a2  +  p2 .  We  need  to  solve  the 
equations1 


-5> 


i= 1 


^2  i  ^2 

cr  +/i 


i= 1 


This  is  a  system  of  2  equations  with  2  unknowns.  The  solution  is 


Recall  that  V(X)  =  E(X2)  -  (E(X))2.  Hence,  E(X2)  =  V(X)  +  (E(X))2. 
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1 

n 


Y,(Xi-Xn)2.  m 


i=  1 


9.6  Theorem.  LetOn  denote  the  method  of  moments  estimator.  Under  appro¬ 
priate  conditions  on  the  model ,  the  following  statements  hold: 


1.  The  estimate  0n  exists  with  probability  tending  to  1. 

^  p 

2.  The  estimate  is  consistent:  0n — >0. 

3.  The  estimate  is  asymptotically  Normal: 


V^{on  -e)~*  n( o,  s) 


where 

S  =  gEe(YYT)gT , 

Y  =  (X,X2,...,Xk)T,  g=(gi,...,gk)  andgj  =  da~1(e)/de. 

The  last  statement  in  the  theorem  above  can  be  used  to  find  standard  errors 
and  confidence  intervals.  However,  there  is  an  easier  way:  the  bootstrap.  We 
defer  discussion  of  this  until  the  end  of  the  chapter. 


9.3  Maximum  Likelihood 

The  most  common  method  for  estimating  parameters  in  a  parametric  model  is 
the  maximum  likelihood  method.  Let  Xi,  . . .,  Xn  be  HD  with  PDF  /(#;  0). 

9.7  Definition.  The  likelihood  function  is  defined  by 

n 

Cn(0)=l[f(Xi-,8).  (9.5) 

i= 1 

The  log- likelihood  function  is  defined  by  £n(0)  =  log Cn{6). 

The  likelihood  function  is  just  the  joint  density  of  the  data,  except  that  we 
treat  it  is  a  function  of  the  parameter  0.  Thus,  Cn  :  ©  [0,  oo).  The 

likelihood  function  is  not  a  density  function:  in  general,  it  is  not  true  that 
Cn(6)  integrates  to  1  (with  respect  to  6). 

9.8  Definition.  The  maximum  likelihood  estimator  mle,  denoted  by 
6n,  is  the  value  of  9  that  maximizes  Cn{6). 


9.3  Maximum  Likelihood 
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FIGURE  9.1.  Likelihood  function  for  Bernoulli  with  n  —  20  and  X  —  12.  The 
MLE  is  pn  =  12/20  =  0.6. 

The  maximum  of  £n(9)  occurs  at  the  same  place  as  the  maximum  of  Cn(0), 
so  maximizing  the  log-likelihood  leads  to  the  same  answer  as  maximizing  the 
likelihood.  Often,  it  is  easier  to  work  with  the  log-likelihood. 

9.9  Remark.  If  we  multiply  Cn(9)  by  any  positive  constant  c  (not  depending 
on  9)  then  this  will  not  change  the  MLE.  Hence,  we  shall  often  drop  constants 
in  the  likelihood  function. 

9.10  Example.  Suppose  that  Xi, . . . ,  Xn  ~  Bernoulli(p).  The  probability  func¬ 
tion  is  f(x\p)  =  px(  1  —  pY~x  for  x  =  0, 1.  The  unknown  parameter  is  p.  Then, 

n  n 

£n(p)  =  H /(V;  p)  =  IRXi(l  ~p)l~Xi  =PS(  1  -p)n~s 

i= 1  i=  1 

where  S  =  J2iXi.  Hence, 

£n(p)  =  S log p  +  (n  -  S)log(  1  -p). 

Take  the  derivative  of  £n(p),  set  it  equal  to  0  to  hnd  that  the  MLE  is  pn  =  S/n. 
See  Figure  9.1.  ■ 

9.11  Example.  Let  Xi, . . .  ,  Xn  ^  7V(/i,cr2).  The  parameter  is  9  =  (/x,  <r)  and 
the  likelihood  function  (ignoring  some  constants)  is: 

=  nrxp{-^(x*-^)2} 


£"n  (/b  ^") 
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—  T) 

a  exp 


nS2 
2  a2 


exp 


n(X  —  /i)2 
2  a2 


where  X  =  n-1  JUX^  is  the  sample  mean  and  S 2  =  n_1  JT(Xi  —  X)2.  The 
last  equality  above  follows  from  the  fact  that  JTpQ  —  /i)2  =  nS"2  +  n(X  — /i)2 
which  can  be  verified  by  writing  —  /i)2  =  JT(Xi  —  X  +  X  —  /i)2  and 

then  expanding  the  square.  The  log-likelihood  is 


t(fi,  <t)  =  — n  log  cr  — 


n-S2  n(X  —  /i): 


2cr2 


2cr2 


Solving  the  equations 


d£(/i,  a) 
d/i 


o  »d  5^  =  0, 

da 


we  conclude  that  =  X  and  a  =  S.  It  can  be  verified  that  these  are  indeed 
global  maxima  of  the  likelihood.  ■ 

9.12  Example  (A  Hard  Example).  Here  is  an  example  that  many  people  find 


confusing.  Let  Xi, 


,Xn~ 


Uni f  (0,0).  Recall  that 


f(x-  9)  =  {  1//^  0  -  x  -  ® 

^  '  ’  '  10  otherwise. 


Consider  a  hxed  value  of  6.  Suppose  6  <  Xi  for  some  i.  Then,  f(Xi,0)  =  0 
and  hence  Cn(0)  =  /(X^;  6)  =  0.  It  follows  that  Cn(0)  =  0  if  any  Xi  >  6. 

Therefore,  Cn(0)  =  0  if  6  <C  X(n)  where  X(n)  —  max{Xi, . . .  ,Xn}.  Now 
consider  any  0  >  X(n).  For  every  Xi  we  then  have  that  /(X^;  0)  =  1/0  so  that 
Cn(0)  =  n,  /(X,;  0)  =  0~n .  In  conclusion, 


Cn(0) 


(U  *>*(») 

0  0  <  x(n) 


See  Figure  9.2.  Now  Ln(0)  is  strictly  decreasing  over  the  interval  [X(n),oo). 
Hence,  0n  =  X(n) .  ■ 


The  maximum  likelihood  estimators  for  the  multivariate  Normal  and  the 
multinomial  can  be  found  in  Theorems  14.5  and  14.3. 


9.4  Properties  of  Maximum  Likelihood  Estimators 

Under  certain  conditions  on  the  model,  the  maximum  likelihood  estimator  0n 
possesses  many  properties  that  make  it  an  appealing  choice  of  estimator.  The 
main  properties  of  the  MLE  are: 


9.4  Properties  of  Maximum  Likelihood  Estimators 
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FIGURE  9.2.  Likelihood  function  for  Uniform  (0,0).  The  vertical  lines  show  the 
observed  data.  The  first  three  plots  show  f(pc\0)  for  three  different  values  of  0. 
When  0  <  X(n)  =  max{Xi, . . . ,  Xn},  as  in  the  first  plot,  f(X(ny,0)  =  0  and 
hence  Cn(6)  =  Y\7=i  —  0-  Otherwise  f(Xi ;  6)  =  1/6  for  each  i  and  hence 

Cn(0)  =  nr=i  =  (1  /9)n.  The  last  plot  shows  the  likelihood  function. 
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1.  The  mle  is  consistent:  9n — >  9 *  where  0*  denotes  the  true  value  of  the 
parameter  9 ; 

2.  The  MLE  is  equivariant:  if  9n  is  the  MLE  of  9  then  g(0n)  is  the  MLE  of 
9(0); 

3.  The  mle  is  asymptotically  Normal:  ( 9  —  9*)/se  ^  N( 0, 1);  also,  the 
estimated  standard  error  se  can  often  be  computed  analytically; 

4.  The  MLE  is  asymptotically  optimal  or  efficient:  roughly,  this  means 
that  among  all  well-behaved  estimators,  the  mle  has  the  smallest  vari¬ 
ance,  at  least  for  large  samples; 

5.  The  MLE  is  approximately  the  Bayes  estimator.  (This  point  will  be  ex¬ 
plained  later.) 

We  will  spend  some  time  explaining  what  these  properties  mean  and  why 
they  are  good  things.  In  sufficiently  complicated  problems,  these  properties 
will  no  longer  hold  and  the  mle  will  no  longer  be  a  good  estimator.  For  now 
we  focus  on  the  simpler  situations  where  the  mle  works  well.  The  properties 
we  discuss  only  hold  if  the  model  satisfies  certain  regularity  conditions. 
These  are  essentially  smoothness  conditions  on  f(x;9).  Unless  otherwise 
stated,  we  shall  tacitly  assume  that  these  conditions  hold. 


9.5  Consistency  of  Maximum  Likelihood  Estimators 


Consistency  means  that  the  mle  converges  in  probability  to  the  true  value. 
To  proceed,  we  need  a  definition.  If  /  and  g  are  pdf’s,  define  the  Kullback- 
Leibler  distance  2  between  /  and  g  to  be 


D(f,g) 


f(x)  log 


fix) 

g(x) 


dx. 


It  can  be  shown  that  D(f,g)  >  0  and  D(f,f)  =  0.  For  any  9,ip  E  ©  write 
D(9,ip)  to  mean  D(f(x ;  0),/(x;  ip)). 

We  will  say  that  the  model  $  is  identifiable  if  9  ^  ip  implies  that  D(0,  pj)  > 
0.  This  means  that  different  values  of  the  parameter  correspond  to  different 
distributions.  We  will  assume  from  now  on  the  the  model  is  identifiable. 


2This  is  not  a  distance  in  the  formal  sense  because  D(f,g)  is  not  symmetric. 


9.6  Equivariance  of  the  mle 
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Let  6b  denote  the  true  value  of  9.  Maximizing  £n(9)  is  equivalent  to  maxi¬ 
mizing 


Mn(0) 


lyiog  /(Xi;  e) 

gf(xi]eiry 


This  follows  since  Mn{6)  =  n  1(£n(9)  —  £n(9*))  and  <6n(6b)  is  a  constant  (with 
respect  to  6).  By  the  law  of  large  numbers,  Mn(6)  converges  to 


^0*  log 


9) 

/PQ;  9 *) 


=  !ioi{wZ))nx’e')ix 

=  ~  Siog{%^),(x'e-)dx 

=  ~D{8*,9). 


Hence,  Mn{9)  ~  —D(9*,9)  which  is  maximized  at  6*  since  —  Z}(6b,6b)  =  0 
and  —  D(0*,  9)  <  0  for  9  Therefore,  we  expect  that  the  maximizer  will 

p 

tend  to  .  To  prove  this  formally,  we  need  more  than  Mn{9) — >  —  D(9*,9). 
We  need  this  convergence  to  be  uniform  over  9.  We  also  have  to  make  sure 
that  the  function  D(0*,0)  is  well  behaved.  Here  are  the  formal  details. 


9.13  Theorem.  Let  9 *  denote  the  true  value  of  9.  Define 


1 


i\/r  (n\  L  VM  f(Xi 5  #) 

M”(")  =  ;£log7«rw 


and  M(9)  =  — D(0*,0).  Suppose  that 


sup  |Mn(0)  —  M(0)| — )►  0 
6>e© 


and  that ,  for  every  e  >  0, 


sup  M{6)  <  M(0*). 

0:\0-0J>e 


Let  9n  denote  the  mle.  Then  9n 


The  proof  is  in  the  appendix. 


9.6  Equivariance  of  the  mle 

9.14  Theorem.  Let  r  =  g{6)  be  a  function  of  9.  Let  9n  be  the  mle  of  9.  Then 
rn  =  g(9n)  is  the  mle  of  t. 
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Proof.  Let  h  =  g  1  denote  the  inverse  of  g.  Then  6n  =  h(rn).  For  any  r, 
C(t)  =  EL  f(xi\  h(r))  =  f(xf,  9)  =  £(9)  where  9  =  h(r).  Hence,  for  any  r, 
C„(t)  =  C{9)  <  C{9)  =  Cn{ ?).  m 

9.15  Example.  Let  X\, . . .  ,Xn  ~  N(9,l).  The  MLEfor  0  is  9n  =  Xn.  Let 
r  =  ee.  Then,  the  mle  for  r  is  f  =  (P  =  eA  .  ■ 


9.7  Asymptotic  Normality 

It  turns  out  that  the  distribution  of  9n  is  approximately  Normal  and  we  can 
compute  its  approximate  variance  analytically.  To  explore  this,  we  first  need 
a  few  definitions. 


9.16  Definition.  The  score  function  is  defined  to  be 

»<A-;  0)  =  al0S^Y;9). 

The  Fisher  information  is  defined  to  be 


n 


Ye  i 


i=l 


J2^e(s(Xt;9)). 

i= 1 


(9.10) 


For  n  =  1  we  will  sometimes  write  1(0)  instead  of  h(0).  It  can  be  shown 
that  E6>(s(X;  0))  =  0.  It  then  follows  that  V6>(s(X;  0))  =  E6>(s2(X;  0)).  In  fact, 
a  further  simplification  of  In(9)  is  given  in  the  next  result. 


9.7  Asymptotic  Normality 
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9.18  Theorem  (Asymptotic  Normality  of  the  mle).  Let  se  =  ^ 

Under  appropriate  regularity  conditions ,  the  following  hold: 

\ZV(f„). 

1.  se  ^  y/l//n(6))  and 

^  jv(0,l). 

se 

(9.12) 

2.  Let  se  =  dl/In{0n).  Then, 

A -0)  ^  N(0, 1). 
se 

(9.13) 

The  proof  is  in  the  appendix.  The  first  statement  says  that  6n  ~  iV(0,  se) 
where  the  approximate  standard  error  of  6n  is  se  =  y/l /In(0).  The  second 
statement  says  that  this  is  still  true  even  if  we  replace  the  standard  error  by 

its  estimated  standard  error  se  =  yj  1  / In{6n). 

Informally,  the  theorem  says  that  the  distribution  of  the  mle  can  be  ap¬ 
proximated  with  N(6,se2).  From  this  fact  we  can  construct  an  (asymptotic) 
confidence  interval. 


9.19  Theorem.  Let 


Cn  —  (^n  ~  zOLj2  se?  @n  +  ^a/2  se^j 


Then ,  P q(6  E  Cn)  1  —  a  as  n  — )►  oo. 


Proof.  Let  Z  denote  a  standard  normal  random  variable.  Then. 
P o(0  £  Cn )  =  P#  ^6n  —  z^i 2  se  <  6  <  6n  +  z^/ 2  se^j 


-Z„to  < 


A/2  N  - - -  <  za/2 

'  se  ' 


P(  —  za/2  <  Z  <  Za/2)  —  1 


For  a  =  .05,  za/2  =  1-96  ss  2,  so: 


On  d=  2  se 


is  an  approximate  95  percent  confidence  interval. 


(9.14) 
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When  you  read  an  opinion  poll  in  the  newspaper,  you  often  see  a  statement 
like:  the  poll  is  accurate  to  within  one  point,  95  percent  of  the  time.  They  are 
simply  giving  a  95  percent  confidence  interval  of  the  form  6n  d=  2  se. 


9.20  Example.  Let  Xi,...,Xn  ^  Bernoulli(p).  The  mle  is  pn  =  JTX^/n 
and  f(x;p)  =  px(l  —  p)1_x,  log  f(x;p)  =  x\ogp  +  (1  —  x)  log(l  —  p), 


and 

Thus, 


s(X;p) 

s\X;p) 

I(p)=Ep(-s\X;p)) 


X  1-X 

P  1  —  P  ’ 

X  1-X 

p2  (1  —  p)2 

p_  (1  -p) 

p2  (1  —  p)2 


1 

p(  i  -pY 


Hence. 


1 


1 


se  = 


p(  1  ~P) 


1/2 


y/fniPn)  VnI(Pn)  l  71 
An  approximate  95  percent  confidence  interval  is 


9.21  Example.  Let  Xi,...,Xn  ^  X(6),cr2)  where  a2  is  known.  The  score 
function  is  s(X;  0)  =  (X  —  0)/a2  and  s'(X ;  0)  =  — 1/cr2  so  that  I\{9)  =  1/a2 . 
The  mle  is  6n  =  Xn.  According  to  Theorem  9.18,  Xn  «  X(6),  a2 /n).  In  this 
case,  the  Normal  approximation  is  actually  exact.  ■ 

9.22  Example.  Let  Xi, . . .  ,Xn  ^  Poisson(A).  Then  An  =  Xn  and  some  cal¬ 
culations  show  that  /i(A)  =  1/A,  so 


se 


A 


n 


n 


Therefore,  an  approximate  1 


a  confidence  interval  for  A  is  An  ±  2 


n. 


9.8  Optimality 

Suppose  that  Xi, . . . ,  Xn  ~  iV(0,  (J2).  The  MLE  is  0n  =  Xn.  Another  reason- 
able  estimator  of  9  is  the  sample  median  9n.  The  mle  satisfies 

\fn(6n  —  6)  ^  N( 0,  a2). 


9.9  The  Delta  Method 
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It  can  be  proved  that  the  median  satisfies 


This  means  that  the  median  converges  to  the  right  value  but  has  a  larger 
variance  than  the  mle. 

More  generally,  consider  two  estimators  Tn  and  Un  and  suppose  that 

^(Tn  -  9)  -  N(0,t2), 

and  that 

^(un-e)  N(o,u2). 

We  define  the  asymptotic  relative  efficiency  of  U  to  T  by  ARE  ft/,  T)  =  t2 /u2. 
In  the  Normal  example,  ARE (0nj9n)  =  2/tt  =  .63.  The  interpretation  is  that 
if  you  use  the  median,  you  are  effectively  using  only  a  fraction  of  the  data. 

9.23  Theorem.  If  6n  is  the  mle  and  6n  is  any  other  estimator  then  3 

ARE(£n,£n)  <  1. 

Thus ,  the  mle  has  the  smallest  (asymptotic)  variance  and  we  say  that  the 

mle  is  efficient  or  asymptotically  optimal. 

This  result  is  predicated  upon  the  assumed  model  being  correct.  If  the  model 
is  wrong,  the  mle  may  no  longer  be  optimal.  We  will  discuss  optimality  in 
more  generality  when  we  discuss  decision  theory  in  Chapter  12. 

9.9  The  Delta  Method 

Let  r  =  g(6)  where  g  is  a  smooth  function.  The  maximum  likelihood  esti¬ 
mator  of  r  is  f  =  g(9).  Now  we  address  the  following  question:  what  is  the 
distribution  of  r? 

9.24  Theorem  (The  Delta  Method).  If  r  =  g{0)  where  g  is  differentiable 
and  gf{6)  0  then 

TT')  N( 0, 1)  (9.15) 

se(r) 

3The  result  is  actually  more  subtle  than  this  but  the  details  are  too  complicated  to  consider 
here. 
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where  rn  =  g(On)  and 


se(TVi)  =  \g' (0)1  se(en) 


(9.16) 


Hence ,  if 


Cri  —  ^ a. / 2  7 'n  T  ^a/2  S6(Tn)^ 


(9.17) 


then  Fe(r  G  Cn)  1  —  a  as  n  ^  oo, 


9.25  Example.  Let  Xi, . . . ,  Xn  ^  Bernoulli(p)  and  let  fj  =  g(p )  =  log(p/(l  — 
p)).  The  Fisher  information  function  is  I(p)  =  l/(p(l  —  p))  so  the  estimated 
standard  error  of  the  mle  pn  is 


se  = 


Pn(l  “Pn) 


The  mle  of  ip  is  ^  =  logp/(l  —  p).  Since,  p'(p)  =  l/(p(l  —  p)),  according  to 
the  delta  method 


se(7n)  =  W  (Pn)\se(pn)  = 


'npn(  1  -  Pn) 


An  approximate  95  percent  confidence  interval  is 


npn(l  -pn) 


9.26  Example.  Let  Xi,...,Xn  ^  iV(/i,cr2).  Suppose  that  fi  is  known,  a  is 
unknown  and  that  we  want  to  estimate  =  log  <7.  The  log-likelihood  is  £(cr)  = 
— n  log  a  —  2^2  ~  h)2 '  Differentiate  and  set  equal  to  0  and  conclude  that 


O'  n  — 


EAXi-p)2 


To  get  the  standard  error  we  need  the  Fisher  information.  First. 


log /(V  c)  =  -  log  <7 


(■ x  ~  p) 

2cr2 


with  second  derivative 


3{X-g) 


tp  \  1  3a2  2 

I{o)  —  +  — J-  —  -7 


cr2  cr4  cr2 


and  hence 


9.10  Multiparameter  Models 
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Therefore,  se 
9'  =  l/cr, 


Gn  /  y/2 n.  Let  ip 
se(^n)  : 


=  g{a)  =  log  <7.  Then,  tpn 

1  On  _  1 

On  y/2n  \f7hx 


logcrn.  Since 


and  an  approximate  95  percent  confidence  interval  is  pjn  db  2/ y^2n. 


9.10  Multiparameter  Models 


These  ideas  can  directly  be  extended  to  models  with  several  parameters.  Let 
6  =  (0i, ,  6k)  and  let  6  =  (0i, . . . ,  6k)  be  the  mle.  Let  £n  =  Y^=i  l°g  /PQj  $) 


33  qq2 


n 


and  Hjk  = 


d2£ 


n 


86jd6k 


Define  the  Fisher  Information  Matrix  by 


ln(0)  =  ~ 


E0(Hn)  Ee(H12) 

^e(H2i)  IE#  {H22) 

•  • 

•  • 

^e(Hki)  ^o(Hk  2) 


^e(Hik) 

^e(H2k) 

^e(Hkk) 


(9.18) 


Let  Jn(6)  =  In  1(0)  be  the  inverse  of  In. 


9.27  Theorem.  Under  appropriate  regularity  conditions, 

(6  —  6)  «  iV(0,  Jn). 


Also,  if  6j  is  the  jth  component  of  6,  then 


3 


~  6j) 


-v~-> 


seJ 


N(  0,1) 


(9.19) 


where  se~  =  Jn{j,j)  is  the  jth  diagonal  element  of  Jn.  The  approximate  co- 


1 


variance  of  9,  and  9k  is  Coy (6,,  6k)  Jni.j'*  • 

There  is  also  a  multiparameter  delta  method.  Let  r 
function  and  let 

/  dg  \ 

1  00i  ' 


g(9 i,..., 0*)  be  a 


V# 


\  I 

V  dOk  ) 


be  the  gradient  of  g 
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9.28  Theorem  (Multiparameter  delta  method).  Suppose  that  Vg  evaluated  at 
0  is  not  0.  Let  r  =  g(0).  Then 


(t-t) 

se(r) 


-w  X(0, 1) 


where 

se(r)  =  Vg)TJn(Vg ),  (9.20) 

Jn  =  Jn(0n)  and  Vg  is  Xg  evaluated  at  9  =  6. 


9.29  Example.  Let  Xi, . . . ,  Xn  ^  X(/i,  cr2).  Let  r  =  g(/i,  cr)  =  a j g.  In  Excer- 
cise  8  you  will  show  that 


4(^,0-) 


n 

CT2 

0 


0 

2  n 

/t2 


Hence, 

Jn  —  In  (/b  cr) 


The  gradient  of  g  is 


1  I"  cr2  0 

n  [  0  4 


Thus, 


(V5)TJ„(V5) 


1/1  cr2 

yTi  y  /I4  2/i2 


9.11  The  Parametric  Bootstrap 

For  parametric  models,  standard  errors  and  confidence  intervals  may  also  be 
estimated  using  the  bootstrap.  There  is  only  one  change.  In  the  nonparametric 
bootstrap,  we  sampled  X*, . . . ,  X*  from  the  empirical  distribution  Fn.  In  the 
parametric  bootstrap  we  sample  instead  from  /(#;  6n).  Here,  6n  could  be  the 
MLE  or  the  method  of  moments  estimator. 

9.30  Example.  Consider  example  9.29.  To  get  the  bootstrap  standard  er¬ 
ror,  simulate  Xi,...,X*  ^  X(/I, cr2),  compute  /2*  =  n_1  JTX*  and  cr2*  = 
n_1  JT(X*  ~  T*)2-  Then  compute  r*  =  g(ji* ,  cr*)  =  cf*//i*.  Repeating  this  B 
times  yields  bootstrap  replications 


ri  ?  •  •  •  >  tb 


9.12  Checking  Assumptions 
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and  the  estimated  standard  error  is 


S^boot 


B 


The  bootstrap  is  much  easier  than  the  delta  method.  On  the  other  hand, 
the  delta  method  has  the  advantage  that  it  gives  a  closed  form  expression  for 
the  standard  error. 


9.12  Checking  Assumptions 

If  we  assume  the  data  come  from  a  parametric  model,  then  it  is  a  good  idea  to 
check  that  assumption.  One  possibility  is  to  check  the  assumptions  informally 
by  inspecting  plots  of  the  data.  For  example,  if  a  histogram  of  the  data  looks 
very  bimodal,  then  the  assumption  of  Normality  might  be  questionable.  A 
formal  way  to  test  a  parametric  model  is  to  use  a  goodness-of-fit  test.  See 
Section  10.8. 


9.13  Appendix 

9.13.1  Proofs 

Proof  of  Theorem  9.13.  Since  9n  maximizes  Mn(9)J  we  have  Mn(9n )  > 
Mn(9+).  Hence, 

Mn(0*)  -  M(9n)  +  M(0*)  -  Mn(0*) 

Mn{9n)  —  M(9n)  +  M(9 *)  —  Mn(0*) 

sup  I Mn(9)  -  M(9) I  +  M(0*)  -  Afn(0*) 
e 

0 


M(0*)  -  M(9n)  = 

< 

< 

p 


where  the  last  line  follows  from  (9.7).  It  follows  that,  for  any  S  >  0, 

P  (M(en)  <  M(0*)  -  s]  ->■  0. 


Pick  any  e  >  0.  By  (9.8),  there  exists  S  >  0  such  that 
M(0)  <  M(9+)  -  S.  Hence, 


<9 


>  e  implies  that 


/ 


9*\  >  e)  <  P  \M(9n)  <  M(0*)  -  (5 


-A  0.  ■ 


Next  we  want  to  prove  Theorem  9.18.  First  we  need  a  lemma. 


136 


9.  Parametric  Inference 


9.31  Lemma.  The  score  function  satisfies 

E#  [s(X;  0)]  =  0. 

Proof.  Note  that  1  =  f  f(x;  6)dx.  Differentiate  both  sides  of  this  equation 
to  conclude  that 


0  = 


d 

89 


f(x;  6)dx  = 


8 

89 


df(x]Q) 

d6 

f(x;6) 


f(x;  0)dx  = 


f(x ;  6)dx 

9  log  f(x;9) 

d9 


f(x ;  0)dx 


s(x;  9)f(x ;  0)dx  =  E6>s(X;  0). 


Proof  of  Theorem  9.18.  Let  £(9)  =  log£(6>).  Then. 


0  =  £f(9)  ^  £f(9)  +  (9  -  9)£"(9) 


Rearrange  the  above  equation  to  get  9  —  9 


yjn{9  —  9)  = 


m 


—£' (9) j £" (9)  or,  in  other  words, 
TOP 

BOTTOM' 


Let  Yi  =  8  log  f(Xi]  9)/d9.  Recall  that  E(Y^)  =  0  from  the  previous  lemma 
and  also  V(Y*)  =  1(9).  Hence, 


TOP  =  n~1/2  '}^Yi  =  XnY  =  y/n(Y  -  0)  W  ~  N(0, 1(0)) 

i 


by  the  central  limit  theorem.  Let  Ai  —  —  82  log  f(Xi]  9)/d92.  Then  E(A^)  = 
1(9 )  and 


BOTTOM  =  A  A  1(9) 

by  the  law  of  large  numbers.  Apply  Theorem  5.5  part  (e),  to  conclude  that 


yfa(0-6) 


—  —  =  N  ( 0,  1 


1(9) 


1(0) 


^  P 

Assuming  that  1(9 )  is  a  continuous  function  of  0,  it  follows  that  I(9n ) — >1(9). 
Now 


9n  ~  9 

se 


V^I1/2(0n)(9n~9) 
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The  first  term  tends  in  distribution  to  N(0,1).  The  second  term  tends  in 
probability  to  1.  The  result  follows  from  Theorem  5.5  part  (e).  ■ 

Outline  of  Proof  of  Theorem  9.24.  Write 


g(0n)  *  g(0)  +  (0n  -  0)g'(0)  =  T  +  (0n-  6)g'(0). 


Thus. 


Vn(rn  -  t)  «  Vn(0n  ~  0)g'(0), 


and  hence  _ 

Theorem  9.18  tells  us  that  the  right-hand  side  tends  in  distribution  to  a  N(0,1) 
Hence, 

y/nI(0)(?n-T) 

- 7W) - {,) 


N(  0,1) 


or,  in  other  words. 


•n  ~ 


N  (r,se2(r„))  , 


where 


se2(?„) 


(g'm 

nl(0) 


The  result  remains  true  if  we  substitute  0n  for  0  by  Theorem  5.5  part  (e) 


9.13.2  Sufficiency 

A  statistic  is  a  function  T(Xn)  of  the  data.  A  sufficient  statistic  is  a  statistic 
that  contains  all  the  information  in  the  data.  To  make  this  more  formal,  we 
need  some  definitions. 


9.32  Definition.  Write  xn  o  yn  if  f(xn;  6)  =  cf(yn ;  0)  for  some  constant 
c  that  might  depend  on  xn  and  yn  but  not  6.  A  statistic  T{xn)  is 
sufficient  ifT(xn)  o  T(yn)  implies  that  xn  o  yn . 

Notice  that  if  o  |/n,  then  the  likelihood  function  based  on  xn  has  the 
same  shape  as  the  likelihood  function  based  on  yn .  Roughly  speaking,  a  statis¬ 
tic  is  sufficient  if  we  can  calculate  the  likelihood  function  knowing  only  T(Xn). 

9.33  Example.  Let  Xi,...,Xn  ^  Bernoulli(p).  Then  C(p)  =  ps(  1  —  p)n~s 
where  S  =  JT  X^  so  S  is  sufficient.  ■ 
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9.34  Example.  Let  Xi, . . .  ,X 

f(Xn;^V  '  1 


n 


r\j 


X(/i,  g)  and  let  T  =  (X,  S').  Then 


n 


g^J 2tt 


exp 


nS 2  1  f  n(X  —  /i)2 

exp 


2cr2 


2  <72 


where  S'2  is  the  sample  variance.  The  last  expression  depends  on  the  data 
only  through  T  and  therefore,  T  =  (X ,  S)  is  a  sufficient  statistic.  Note  that 
U  =  (17  X,  S')  is  also  a  sufficient  statistic.  If  I  tell  you  the  value  of  U  then  you 
can  easily  figure  out  T  and  then  compute  the  likelihood.  Sufficient  statistics 
are  far  from  unique.  Consider  the  following  statistics  for  the  X(/i,cr2)  model: 


Ti(X-)  =  (Xi,...,Xn) 

T2(Xn)  =  (X,  S') 

T3(Xn)  =  X 
T4(Xn)  =  (X,  S',  X3). 


The  first  statistic  is  just  the  whole  data  set.  This  is  sufficient.  The  second 
is  also  sufficient  as  we  proved  above.  The  third  is  not  sufficient:  you  can’t 
compute  £(/i,  g)  if  I  only  tell  you  X.  The  fourth  statistic  T4  is  sufficient.  The 
statistics  T\  and  T4  are  sufficient  but  they  contain  redundant  information. 
Intuitively,  there  is  a  sense  in  which  T2  is  a  “more  concise”  sufficient  statistic 
than  either  T\  or  T4.  We  can  express  this  formally  by  noting  that  T2  is  a 
function  of  T\  and  similarly,  T2  is  a  function  of  T4.  For  example,  T2  =  g(T4) 
where  g(ai,a2,n3)  =  (ai?a2)-  ■ 


9.35  Definition.  A  statistic  T  is  minimal  sufficient  if  (i)  it  is 

sufficient ;  and  (ii)  it  is  a  function  of  every  other  sufficient  statistic. 


9.36  Theorem.  T  is  minimal  sufficient  if  the  following  is  true: 

T(xn)  =  T(yn )  if  and  only  if  xn 

A  statistic  induces  a  partition  on  the  set  of  outcomes.  We  can  think  of 
sufficiency  in  terms  of  these  partitions. 

9.37  Example.  Let  Xi,X2  ^  Bernoulli(O).  Let  V  =  Xi,  T  =  J2iXi  and 
U  =  (T,  X i).  Here  is  the  set  of  outcomes  and  the  statistics: 


X1 

X2 

V 

T 

u 

0 

0 

0 

0 

(0,0) 

0 

1 

0 

1 

(1,0) 

1 

0 

1 

1 

(1,1) 

1 

1 

1 

2 

(2,1) 
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The  partitions  induced  by  these  statistics  are: 

V  — ►  {(0,0),  (0,1)},  {(1,0),  (1,1)} 

T  — >  {(0,0)},  {(0,1), (1,0)},  {(1,1)} 

U  — ►  {(0,0)},  {(0,1)},  {(1,0)},  {(1,1)}. 

Then  V  is  not  sufficient  but  T  and  U  are  sufficient.  T  is  minimal  sufficient; 
U  is  not  minimal  since  if  xn  =  (1,0)  and  yn  =  (0,1),  then  xn  o  yn  yet 
U(xn)  7^  U(yn).  The  statistic  W  =  17 T  generates  the  same  partition  as  T.  It 
is  also  minimal  sufficient.  ■ 

9.38  Example.  For  a  X(/i,cr2)  model,  T  =  (X,  5)  is  a  minimal  sufficient 
statistic.  For  the  Bernoulli  model,  T  =  JT  Xi  is  a  minimal  sufficient  statistic. 
For  the  Poisson  model,  T  =  JT  Xi  is  a  minimal  sufficient  statistic.  Check  that 
T  =  (JTX^X i)  is  sufficient  but  not  minimal  sufficient.  Check  that  T  =  X\ 
is  not  sufficient.  ■ 

I  did  not  give  the  usual  definition  of  sufficiency.  The  usual  definition  is  this: 
T  is  sufficient  if  the  distribution  of  Xn  given  T(Xn)  =  t  does  not  depend  on 
6.  In  other  words,  T  is  sufficient  if  f(x i, . . . ,  xn\t;  0)  =  h(x i, . . . ,  xn,  t)  where 
h  is  some  function  that  does  not  depend  on  6. 

9.39  Example.  Two  coin  flips.  Let  X  =  (Xi,X2)  ^  Bernoulli(p).  Then  T  = 
X\  +  X2  is  sufficient.  To  see  this,  we  need  the  distribution  of  (Xi,X2)  given 
T  =  t.  Since  T  can  take  3  possible  values,  there  are  3  conditional  distributions 
to  check.  They  are:  (i)  the  distribution  of  (Xi,X2)  given  T  =  0: 


P(X  1  =  0,X2  =0t  =  0)  =  1  ,P(Xi  = 

o,x2 

=  1 1 

=  0) 

=  0, 

P(X  1  =  1,  X2  =  0  t  =  0)  =  0,  P(X  1  = 

i,V 

=  1 1 

=  0) 

=  0; 

(ii)  the  distribution  of  (Xi,X2)  given  T  =  1: 

P(X i  =  0,  X2  =  0  t  =  1)  =  0,  P(X i  = 

0,x2 

=  1 1 

=  1) 

1 

_  2’ 

P{X1  =  1,X2  =  0  t  =  1)  =  \,P{Xi  =  1, 

X2  = 

1 1  = 

1)  = 

0;  and 

(iii)  the  distribution  of  (Xi,X2)  given  T  =  2: 

P(X i  =  0,  X2  =  0  t  =  2)  =  0,  P(X i  = 

0,x2 

=  1 1 

=  2) 

=  0, 

P(X i  =  1,  X2  =  0  t  =  2)  =  0,  P(X i  = 

l,X2 

=  1 1 

=  2) 

=  1. 

None  of  these  depend  on  the  parameter  p.  Thus,  the  distribution  of  Xi,  X2I T 
does  not  depend  on  #,  so  T  is  sufficient.  ■ 
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9.40  Theorem  (Factorization  Theorem).  T  is  sufficient  if  and  only  if  there  are 
functions  g(t,9)  and  h(x)  such  that  f(xn;9)  =  g(t(xn),9)h(xn). 

9.41  Example.  Return  to  the  two  coin  flips.  Let  t  =  x\  +  #2-  Then 

f(x  !,x2;0)  =  f(x1;0)f(x2;d) 

=  g(t,0)h(x1,x2) 

where  g(t,9)  =  8t(  1  —  #)2_t  and  h(x i,x2)  =  1.  Therefore,  T  =  X i  +  X2  is 
sufficient.  ■ 

Now  we  discuss  an  implication  of  sufficiency  in  point  estimation.  Let  9  be 
an  estimator  of  9.  The  Rao-Blackwell  theorem  says  that  an  estimator  should 
only  depend  on  the  sufficient  statistic,  otherwise  it  can  be  improved.  Let 
R{6 ,  6)  =  1&q(0  —  O)2  denote  the  mse  of  the  estimator. 

9.42  Theorem  (Rao-Blackwell).  Let  0  be  an  estimator  and  letT  be  a  sufficient 
statistic.  Define  a  new  estimator  by 

9  =  E(9\T). 

Then ,  for  every  9,  R{9 ,  9)  <  R{0 ,  9). 

9.43  Example.  Consider  flipping  a  coin  twice.  Let  9  =  X\.  This  is  a  well- 
defined  (and  unbiased)  estimator.  But  it  is  not  a  function  of  the  sufficient 
statistic  T  =  X\  +  X2.  However,  note  that  9  =  E(Xi|T)  =  (Xi  +  Xf)!*!.  By 
the  Rao-Blackwell  Theorem,  9  has  MSE  at  least  as  small  as  9  —  X\.  The 
same  applies  with  n  coin  flips.  Again  define  9  =  X\  and  T  =  JT  JQ.  Then 
9  =  E(Xl|T)  =  n~l  'YhiXi  has  improved  MSE.  ■ 

9.13.3  Exponential  Families 

Most  of  the  parametric  models  we  have  studied  so  far  are  special  cases  of 
a  general  class  of  models  called  exponential  families.  We  say  that  {/(#;  9)  : 
9  G  ©}  is  a  one-parameter  exponential  family  if  there  are  functions  77(0), 
L>(0),  T(x)  and  h(x)  such  that 

f(x;  9)  =  h(x)evie)T{x)-B{0) . 

It  is  easy  to  see  that  T(X)  is  sufficient.  We  call  T  the  natural  sufficient 
statistic. 
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9.44  Example.  Let  X  ~  Poisson (9).  Then 


f(x;0)  = 


Qxp-e  i 

”  ^  _  x\og  6—6 

7»t  nr\ 


and  hence,  this  is  an  exponential  family  with  rj(8)  =  log  9 ,  B(6) 
h(x)  =  1  jx\.  u 


9 ,  T(x)  =  x. 


9.45  Example.  Let  X  ~  Binomial(n,  9).  Then 


n 


Hx-,e)  =  [  \e*(i-e) 


n  x  /  71  \  exp  ]  x  jQg  1 


X 


1-9 


+  nlog(l  -  9)  > . 


In  this  case. 


r](9)  =  log 


9 


1-9 


,  B{9)  =  —  nlog(0) 


and 


T(x)  =  x,  h(x)  =  ^  V 


We  can  rewrite  an  exponential  family  as 

f(x]rj)  =  h(x)enT(yX^~A(yr1^ 

where  rj  =  rj(9)  is  called  the  natural  parameter  and 

A(rj)  =  log  J  h(x)er,T(yX"> dx. 

For  example  a  Poisson  can  be  written  as  f(x;  rj)  =  evx~e11  /x\  where  the  natural 
parameter  is  ij  =  log  6. 

Let  Xi,...,Xn  be  HD  from  an  exponential  family.  Then  f(xn;9)  is  an 
exponential  family: 

f(xn;9)  = 

where  hn{xn)  =  \ Tn{xn)  =  Y^%T{xi)  and  Bn{9)  =  nB{9).  This 
implies  that  is  sufficient. 

9.46  Example.  Let  Xi, . . .  ,  Xn  ^  Uniform(0,  9).  Then 

f{xn-,0)  =  ±I{xM<0) 

where  I  is  1  if  the  term  inside  the  brackets  is  true  and  0  otherwise,  and 
X(n)  =  max{x i, . . . ,  xn  }.  Thus  T(Xn)  =  max{X i, . . .  ,  Xn}  is  sufficient.  But 
since  T(Xn )  ^  JTTpQ),  this  cannot  be  an  exponential  family.  ■ 
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9.47  Theorem.  Let  X  have  density  in  an  exponential  family.  Then , 

E (T{X))  =  A'(V),  V(T(X))  =  A"{rj). 

If  6  =  (6h, . . . ,  Ok)  is  a  vector,  then  we  say  that  f(x;0)  has  exponential 
family  form  if 


f{x;9)  =  h(x)  exp  T]j(9)Tj(x)  -  B(9)  j  . 

Again,  T  =  (7\, . . . ,  T^)  is  sufficient.  An  HD  sample  of  size  n  also  has  expo¬ 
nential  form  with  sufficient  statistic  (JA  Ti(X^), . . . ,  JA  Tk{Xf)). 

9.48  Example.  Consider  the  normal  family  with  6  =  (/x,  cr).  Now, 
f(x-  9)  =  eXp  |  -!fx  -  ^2  -  1  (^  +  log(27T(J2)^  J  . 

This  is  exponential  with 

m(0)  =  A  X(x)  =  x 

crz 

V2  (9)  =  ~2~2>  T2(x)  =  x2 
B(Q)  =  \  (x  +  log(27T<72d  ,  h(x)  =  1- 

Hence,  with  n  HD  samples,  (JA  JQ,  JAX?)  is  sufficient.  ■ 

As  before  we  can  write  an  exponential  family  as 

f(x;  7j)  =  h(x)  exp  [Tt  {x)g  —  A(rj)}  , 
where  A(rj)  =  log  f  h(x)eT  ^^dx.  It  can  be  shown  that 


E(T(X))  =  A{rj)  V(T(X))=A(r?), 


where  the  hrst  expression  is  the  vector  of  partial  derivatives  and  the  second 
is  the  matrix  of  second  derivatives. 


9.13.4  Computing  Maximum  Likelihood  Estimates 

In  some  cases  we  can  find  the  mle  9  analytically.  More  often,  we  need  to 
find  the  mle  by  numerical  methods.  We  will  briefly  discuss  two  commonly 
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used  methods:  (i)  Newton-Raphson,  and  (ii)  the  EM  algorithm.  Both  are 
iterative  methods  that  produce  a  sequence  of  values  00,#1,...  that,  under 
ideal  conditions,  converge  to  the  mle  6.  In  each  case,  it  is  helpful  to  use  a 
good  starting  value  0°.  Often,  the  method  of  moments  estimator  is  a  good 
starting  value. 

Newton-Raphson.  To  motivate  Newton-Raphson,  let’s  expand  the  deriva¬ 
tive  of  the  log-likelihood  around  : 


o  =  i'{&)  « £\ej)  +  (e- 


Solving  for  0  gives 


<9  «  6j 


/(#J) 

T(T)' 


This  suggests  the  following  iterative  scheme: 


ej+1  =  ej 


T{er)' 


In  the  multiparameter  case,  the  mle  6  =  (#i ,...,#&)  is  a  vector  and  the 
method  becomes 

ej+1  =  ej  -h-1/^) 

where  £  (#J)  is  the  vector  of  first  derivatives  and  H  is  the  matrix  of  second 
derivatives  of  the  log-likelihood. 

The  EM  Algorithm.  The  letters  EM  stand  for  Expectation-Maximization. 
The  idea  is  to  iterate  between  taking  an  expectation  then  maximizing.  Sup¬ 
pose  we  have  data  Y  whose  density  f(y;0)  leads  to  a  log-likelihood  that  is 
hard  to  maximize.  But  suppose  we  can  find  another  random  variable  Z  such 
that  f(y ;  0)  =  f  f(y ,  z;  0)  dz  and  such  that  the  likelihood  based  on  f(y ,  z;  0) 
is  easy  to  maximize.  In  other  words,  the  model  of  interest  is  the  marginal  of  a 
model  with  a  simpler  likelihood.  In  this  case,  we  call  Y  the  observed  data  and 
Z  the  hidden  (or  latent  or  missing)  data.  If  we  could  just  “fill  in”  the  missing 
data,  we  would  have  an  easy  problem.  Conceptually,  the  EM  algorithm  works 
by  filling  in  the  missing  data,  maximizing  the  log-likelihood,  and  iterating. 


9.49  Example  (Mixture  of  Normals).  Sometimes  it  is  reasonable  to  assume  that 
the  distribution  of  the  data  is  a  mixture  of  two  normals.  Think  of  heights  of 
people  being  a  mixture  of  men  and  women’s  heights.  Let  (j)(y;  /i,  a)  denote 
a  normal  density  with  mean  fi  and  standard  deviation  a.  The  density  of  a 
mixture  of  two  Normals  is 


f(y ;  0)  =  (1  -  p)4>(y\ Mo,  <?o)  +  W>(y;  mi,  ^i)- 
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The  idea  is  that  an  observation  is  drawn  from  the  first  normal  with  probability 
p  and  the  second  with  probability  1  —p.  However,  we  don’t  know  which  Normal 
it  was  drawn  from.  The  parameters  are  6  =  (/xq,  oo,  /H?  04, p).  The  likelihood 
function  is 

n 

£(0)  =  II K1  -  p)4>{v% ; mo, cr0)  +  p4>{vi\ mi,  ^i)'  • 

i=  1 

Maximizing  this  function  over  the  five  parameters  is  hard.  Imaging  that  we 
were  given  extra  information  telling  us  which  of  the  two  normals  every  observa¬ 
tion  came  from.  These  “complete”  data  are  of  the  form  (Yi,  Z 1), . . . ,  (Yn,  Zn), 
where  =  0  represents  the  first  normal  and  =  1  represents  the  second. 
Note  that  =  1)  =  p.  We  shall  soon  see  that  the  likelihood  for  the  com¬ 
plete  data  (Yi,  Z 1), . . . ,  (Yni  Zn)  is  much  simpler  than  the  likelihood  for  the 
observed  data  Yi, . . . ,  Yn.  u 

Now  we  describe  the  EM  algorithm. 


The  EM  Algorithm 

(0)  Pick  a  starting  value  6 °.  Now  for  j  =  1,2,. 
below: 

(1)  (The  E-step):  Calculate 


,  repeat  steps  1  and  2 


J(0\P)=Eoi  log 


/(Yn,  Zn;  0) 
f(Yn,Zn;0i) 


Yn  =  yn 


The  expectation  is  over  the  missing  data  Zn  treating  9l  and  the  observed 
data  Yn  as  fixed. 

(2)  Find  #J+1  to  maximize  J(6\6^). 


We  now  show  that  the  EM  algorithm  always  increases  the  likelihood,  that 
is,  £(6d+1)  >  C(6i).  Note  that 


J(6j+1\6j) 


log 


/(Yn,  Zn]  #J+1) 
/(Yn,  Zn\  Qi) 


,  f(yn-,0j+1)  ,  „ 

log  7 YTT  + 


log 


f(Zn\Yn;0j+1) 
f(Zn\Yn;  03) 


and  hence 

£(03+l) 

£(03) 


f(yn-,0j+1) 

f(yn',Qj) 
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=  j(ej+1\ej) -Eej 


f(Zn\Yn;di+1) 
f(Zn\Yn;  Qi) 


=  J{0^)^K{f3J3+1) 


where  fj  =  f(yn;  6J)  and  fj+1  =  f(yn;  6J+1)  and  K(f,  g)  =  f  f(x)  \og(f(x)/g(x))  dx 
is  the  Kullback-Leibler  distance.  Now,  6,J+1  was  chosen  to  maximize  J(6\Qi). 
Hence,  J(6)J+1  \Qi)  >  J(Qi\Qi)  =  0.  Also,  by  the  properties  of  Kullback-Leibler 
divergence,  K(fj ,  /,'+i)  >  0.  Hence,  £(6>J+1)  >  C(Qi)  as  claimed. 


9.50  Example  (Continuation  of  Example  9.49).  Consider  again  the  mixture  of 
two  normals  but,  for  simplicity  assume  that  p  =  1/2,  <ri  =  <72  =  1.  The  density 
is 

1  1 

f{y]  Pi,P2)  =  ~(f)(y;poA)  + 

Directly  maximizing  the  likelihood  is  hard.  Introduce  latent  variables  Zi, . . . ,  Zn 
where  Z^  =  0  if  Yi  is  from  (j)(y;  /xq,  1),  and  Z^  =  1  if  is  from  </(?/;  /xi ,  1), 
P(Z;  =  1)  =  P(Z*  =  0)  =  1/2,  /(2/i|Zi  =  o)  =  </(y;/i0,l)  and  /(?//Z;  =  1)  = 
(j)(y;  p i,l).  So  /(?/)  =  X^=o  f(y^z)  where  we  have  dropped  the  parameters 
from  the  density  to  avoid  notational  overload.  We  can  write 

f(z,  y)  =  f{z)f{y\z)  =  Mo,  1)1-2<?K2/;  hi,  lY- 
Hence,  the  complete  likelihood  is 


n 


rDtoo,1)1 


i=  1 


The  complete  log-likelihood  is  then 


o  Xh  -zY{yi  -  mo)  -  -mo- 


i=l 


i=  1 


And  so 


n 


n 


J{0\0j)  =  -E(Zt\yn,dY)(yi  -  yo)  -  -hi)- 

£=1  £=1 

Since  Z^  is  binary,  E(Z/?/n,6h)  =  P(Z^  =  l|yn,#J)  and,  by  Bayes’  theorem, 

P(z  =  i  ?/»  ^  =  _ /(i/wl^  =  i;QjW(Zi  =  l) _ 

V  *  U  '  1  f(yn\Zt  =  Y,9j)F(Zi  =  l)  +  f(yn\Zi  =  0 =  0) 

4>(yf,y  {,  1)5 


<t>{Vi\ 1)  \  +  <t>{Vi\ 

1,1) 

0(2/*;  hi,1)  T'lKhhho,1) 

=  t(*)- 
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Take  the  derivative  of  J{6\ Qi)  with  respect  to  fi \  and  /i2,  set  them  equal  to  0 
to  get 

En 

_  j=  1  TiVi 
Fl  sr^n 

l^i=l  Ti 


and 


E"=1(l  -Ti)yi 
E”=1(i-n)  ' 


We  then  recompute  using 


and  /jtq 


j+i 


and  iterate.  ■ 


9.14  Exercises 

1.  Let  Xi, . . . ,  Xn  Gamma(a,  j3).  Find  the  method  of  moments  estimator 
for  a  and  (3. 

2.  Let  Xi, . . . ,  Xn  Uniform(a,  b)  where  a  and  b  are  unknown  parameters 
and  a  <  b. 

(a)  Find  the  method  of  moments  estimators  for  a  and  b. 

(b)  Find  the  mle  a  and  b. 

(c)  Let  r  =  f  xdF(x).  Find  the  mle  of  r. 

(d)  Let  r  be  the  mle  of  r.  Let  r  be  the  nonparametric  plug-in  estimator 
of  T  =  f  xdF(x).  Suppose  that  a  —  1,  b  =  3,  and  n  =  10.  Find  the  MSE 
of  r  by  simulation.  Find  the  MSE  of  r  analytically.  Compare. 

3.  Let  Xi, . . . ,  Xn  iV(/x,  a2).  Let  r  be  the  .95  percentile,  i.e.  P(X  <  r)  = 

.95. 

(a)  Find  the  mle  of  r. 

(b)  Find  an  expression  for  an  approximate  1  —  a  confidence  interval  for 

r. 

(c)  Suppose  the  data  are: 


3.23 

-2.50 

1.88 

-0.68 

4.43 

0.17 

1.03 

-0.07 

-0.01 

0.76 

1.76 

3.18 

0.33 

-0.31 

0.30 

-0.61 

1.52 

5.43 

1.54 

2.28 

0.42 

2.33 

-1.03 

4.00 

0.39 


Find  the  mle  t.  Find  the  standard  error  using  the  delta  method.  Find 
the  standard  error  using  the  parametric  bootstrap. 
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4.  Let  Xi, . . . ,  Xn  Uniform(0,  9).  Show  that  the  mle  is  consistent.  Hint: 
Let  Y  =  max{Xi, ..., Xn}.  For  any  c,  P(U  <  c)  =  P(Xi  <  c,X2  < 
c, ...,  Xn  <  c)  =  P(Xi  <  c)P(X 2  <  c)...P(Xn  <  c). 


5.  Let  Xi, . . .  ,  Xn  ^  Poisson(A).  Find  the  method  of  moments  estimator, 
the  maximum  likelihood  estimator  and  the  Fisher  information  /(A). 

6.  Let  Xi, Xn  ^  X($,  1).  Define 

f  1  if  X,  >  0 
1  1  0  if  X;  <  0. 


Let  Pj  =  P(Yi  =  1). 

(a)  Find  the  maximum  likelihood  estimator  pj  of  pj. 

(b)  Find  an  approximate  95  percent  confidence  interval  for  pj. 

(c)  Define  ip  =  (1/n)  JT  Y*.  Show  that  ip  is  a  consistent  estimator  of  ip. 

(d)  Compute  the  asymptotic  relative  efficiency  of  ip  to  pj.  Hint:  Use  the 
delta  method  to  get  the  standard  error  of  the  mle.  Then  compute  the 
standard  error  (i.e.  the  standard  deviation)  of  pj. 

(e)  Suppose  that  the  data  are  not  really  normal.  Show  that  pj  is  not 
consistent.  What,  if  anything,  does  pj  converge  to? 

7.  (Comparing  two  treatments.)  n\  people  are  given  treatment  1  and  n 2 
people  are  given  treatment  2.  Let  X\  be  the  number  of  people  on  treat¬ 
ment  1  who  respond  favorably  to  the  treatment  and  let  X2  be  the 
number  of  people  on  treatment  2  who  respond  favorably.  Assume  that 
Xi  ^  Binomial(ni,pi)  X2  ^  Binomial(n2,P2)-  Let  ip  =  pi  —  P2- 

(a)  Find  the  MLE  pj  for  pj. 

(b)  Find  the  Fisher  information  matrix  I(pi,p2). 

(c)  Use  the  multiparameter  delta  method  to  find  the  asymptotic  stan¬ 
dard  error  of  pj. 

(d)  Suppose  that  n\  =  n2  =  200,  X\  =  160  and  X2  =  148.  Find  pj.  Find 
an  approximate  90  percent  confidence  interval  for  pj  using  (i)  the  delta 
method  and  (ii)  the  parametric  bootstrap. 

8.  Find  the  Fisher  information  matrix  for  Example  9.29. 

9.  Let  Xi,  ...,Xn  ^  Normal(/i,  1).  Let  9  =  and  let  9  =  ex  be  the  mle. 
Create  a  data  set  (using  fi  =  5)  consisting  of  n=100  observations. 
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(a)  Use  the  delta  method  to  get  se  and  a  95  percent  confidence  interval 
for  9.  Use  the  parametric  bootstrap  to  get  se  and  95  percent  confidence 
interval  for  9.  Use  the  nonpar ametric  bootstrap  to  get  se  and  95  percent 
confidence  interval  for  9.  Compare  your  answers. 

(b)  Plot  a  histogram  of  the  bootstrap  replications  for  the  parametric 
and  nonparametric  bootstraps.  These  are  estimates  of  the  distribution 
of  6.  The  delta  method  also  gives  an  approximation  to  this  distribution 
namely,  Normal($,  se2).  Compare  these  to  the  true  sampling  distribu¬ 
tion  of  9  (which  you  can  get  by  simulation).  Which  approximation  — 
parametric  bootstrap,  bootstrap,  or  delta  method  —  is  closer  to  the  true 
distribution? 

10.  Let  Xi, ...,  Xn  Uniform(0,  9).  The  mle  is  9  =  X(n)  =  max{Xi, ...,  Xn}. 
Generate  a  dataset  of  size  50  with  9  =  1. 

(a)  Find  the  distribution  of  9  analytically.  Compare  the  true  distribu¬ 
tion  of  9  to  the  histograms  from  the  parametric  and  nonparametric 
bootstraps. 

(b)  This  is  a  case  where  the  nonparametric  bootstrap  does  very  poorly. 
Show  that  for  the  parametric  bootstrap  P($*  =  9)  =  0,  but  for  the 
nonparametric  bootstrap  P($*  =  9)  ~  .632.  Hint:  show  that,  P($*  = 
9)  =  1  —  (1  —  (l/n))n  then  take  the  limit  as  n  gets  large.  What  is  the 
implication  of  this? 


10 

Hypothesis  Testing  and  p- values 


Suppose  we  want  to  know  if  exposure  to  asbestos  is  associated  with  lung 
disease.  We  take  some  rats  and  randomly  divide  them  into  two  groups.  We 
expose  one  group  to  asbestos  and  leave  the  second  group  unexposed.  Then 
we  compare  the  disease  rate  in  the  two  groups.  Consider  the  following  two 
hypotheses: 

The  Null  Hypothesis:  The  disease  rate  is  the  same  in  the  two  groups. 

The  Alternative  Hypothesis:  The  disease  rate  is  not  the  same  in  the  two 
groups. 

If  the  exposed  group  has  a  much  higher  rate  of  disease  than  the  unexposed 
group  then  we  will  reject  the  null  hypothesis  and  conclude  that  the  evidence 
favors  the  alternative  hypothesis.  This  is  an  example  of  hypothesis  testing. 

More  formally,  suppose  that  we  partition  the  parameter  space  ©  into  two 
disjoint  sets  ©o  and  ©i  and  that  we  wish  to  test 

Ho  :  6  G  ©o  versus  Hi  :  9  E  ©i.  (10.1) 

We  call  Ho  the  null  hypothesis  and  Hi  the  alternative  hypothesis. 

Let  X  be  a  random  variable  and  let  A  be  the  range  of  X.  We  test  a  hypoth¬ 
esis  by  finding  an  appropriate  subset  of  outcomes  R  C  X  called  the  rejection 
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Hq  true 
H i  true 


Retain  Null 

v_ _ 

type  II  error 


Reject  Null 
type  I  error 

V 


TABLE  10.1.  Summary  of  outcomes  of  hypothesis  testing. 

region.  If  X  E  R  we  reject  the  null  hypothesis,  otherwise,  we  do  not  reject 
the  null  hypothesis: 


X  eR 
X^R 


reject  Hq 

retain  (do  not  reject)  Hq 


Usually,  the  rejection  region  R  is  of  the  form 


R  =  <  x  :  T(x)  >  c 


(!0.2) 


where  T  is  a  test  statistic  and  c  is  a  critical  value.  The  problem  in  hy¬ 
pothesis  testing  is  to  find  an  appropriate  test  statistic  T  and  an  appropriate 
critical  value  c. 

Warning!  There  is  a  tendency  to  use  hypothesis  testing  methods  even 
when  they  are  not  appropriate.  Often,  estimation  and  confidence  intervals  are 
better  tools.  Use  hypothesis  testing  only  when  you  want  to  test  a  well-defined 
hypothesis. 

Hypothesis  testing  is  like  a  legal  trial.  We  assume  someone  is  innocent 
unless  the  evidence  strongly  suggests  that  he  is  guilty.  Similarly,  we  retain  Hq 
unless  there  is  strong  evidence  to  reject  Hq.  There  are  two  types  of  errors  we 
can  make.  Rejecting  Hq  when  Hq  is  true  is  called  a  type  I  error.  Retaining 
H0  when  H i  is  true  is  called  a  type  II  error.  The  possible  outcomes  for 
hypothesis  testing  are  summarized  in  Tab.  10.1. 


10.1  Definition.  The  power  function  of  a  test  with  rejection  region  R  is 
defined  by 

(3(6)  =  Fe(X  e  R).  (10.3) 

The  size  of  a  test  is  defined  to  be 


a  =  sup  (3(6) 
ee&o 


(10.4) 


A  test  is  said  to  have  level  a  if  its  size  is  less  than  or  equal  to  a. 
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A  hypothesis  of  the  form  ^  =  6^o  is  called  a  simple  hypothesis.  A  hypoth¬ 
esis  of  the  form  0  >  9q  or  0  <  9$  is  called  a  composite  hypothesis.  A  test 
of  the  form 

Ho  :  6  =  0o  versus  H\  :  6  ^  6o 
is  called  a  two-sided  test.  A  test  of  the  form 

H0  :  6  <  6o  versus  Hi  :  6  >  6q 

or 

H0  :6>60  versus  Hi  :  6  <  6q 

is  called  a  one-sided  test.  The  most  common  tests  are  two-sided. 


10.2  Example.  Let  Xi, . . . ,  Xn  ^  iV(/i,  a)  where  a  is  known.  We  want  to  test 
Ho  :  fi  <0  versus  Hi  :  fi  >  0.  Hence,  ©o  =  (— oo,0]  and  ©i  =  (0,  oo). 
Consider  the  test: 

reject  Ho  if  T  >  c 

where  T  =  X.  The  rejection  region  is 

R=< |(xi, . . . ,  xn)  :  T(xi, . . . ,  xn)  >  c 
Let  Z  denote  a  standard  Normal  random  variable.  The  power  function  is 


P(») 


IV  (x  > c) 


P  (  Vn(X  -  /i)  v^(c 

M  V  cr  cr 


-  fi) 


=  P  z  > 


1  -  <F 


a 

\fn(c  -  /i) 
a 

Vn(c  -  n) 
a 


This  function  is  increasing  in  /i.  See  Figure  10.1.  Hence 


size  =  sup  /3(/x)  =  /3(0)  =  1  (I> 

m<o  \  <J  J 

For  a  size  a  test,  we  set  this  equal  to  a  and  solve  for  c  to  get 

a  <F_1(1  —  a) 
c  =  - 


n 


We  reject  when  X  —  a)/y/n.  Equivalently,  we  reject  when 

v^(X-O) 


a 


where  =  <L  1(1  —  a) 
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FIGURE  10.1.  The  power  function  for  Example  10.2.  The  size  of  the  test  is  the 
largest  probability  of  rejecting  Ho  when  Ho  is  true.  This  occurs  at  /j,  =  0  hence  the 
size  is  /3(0).  We  choose  the  critical  value  c  so  that  /3(0)  =  a. 


It  would  be  desirable  to  find  the  test  with  highest  power  under  H i,  among 
all  size  a  tests.  Such  a  test,  if  it  exists,  is  called  most  powerful.  Finding 
most  powerful  tests  is  hard  and,  in  many  cases,  most  powerful  tests  don’t 
even  exist.  Instead  of  going  into  detail  about  when  most  powerful  tests  exist, 
we’ll  just  consider  four  widely  used  tests:  the  Wald  test,1  the  x2  test,  the 
permutation  test,  and  the  likelihood  ratio  test. 


10.1  The  Wald  Test 

Let  6  be  a  scalar  parameter,  let  6  be  an  estimate  of  6  and  let  se  be  the 
estimated  standard  error  of  6. 

xThe  test  is  named  after  Abraham  Wald  (1902-1950),  who  was  a  very  influential  mathe¬ 
matical  statistician.  Wald  died  in  a  plane  crash  in  India  in  1950. 


10.1  The  Wald  Test 
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10.3  Definition.  The  Wald  Test 


Consider  testing 


Ho  :  0  =  #o  versus  Hi  :  0  ^  Oq. 


Assume  that  6  is  asymptotically  Normal: 


(< 9  -  Oq) 


N(  0,1) 


The  size  a  Wald  test  is:  reject  Hq  when  \  W\  >  z^j 2  where 


0-0q 


(10.5) 


10.4  Theorem.  Asymptotically,  the  Wald  test  has  size  a,  that  is, 


{\W\  >  za/2)  -r 


as  n  ^  oo. 


Proof.  Under  0  =  Oq,  (0  —  #o)/se  7V(0, 1).  Hence,  the  probability  of 
rejecting  when  the  null  0  =  Oq  is  true  is 


(\W\  >  za/2) 


0  —  0( 


>  Za/2 


^  (\Z\  >  zct/2) 


=  a 


where  Z  ^  N( 0, 1).  ■ 

10.5  Remark.  An  alternative  version  of  the  Wald  test  statistic  is  W  =  (0  — 
0 o)/seo  where  seo  is  the  standard  error  computed  at  0  =  Oq.  Both  versions  of 
the  test  are  valid. 

Let  us  consider  the  power  of  the  Wald  test  when  the  null  hypothesis  is  false. 

10.6  Theorem.  Suppose  the  true  value  ofO  is  0 *  ^  Oq.  The  power  f3(0±)  —  the 
probability  of  correctly  rejecting  the  null  hypothesis  —  is  given  (approximately) 
by 

1  _  $  (  U  +  Za/z)  +  ®(  U  -  Za/j)  ■  (10'6) 


+  za/2j  +  $  ( 
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Recall  that  se  tends  to  0  as  the  sample  size  increases.  Inspecting  (10.6) 
closely  we  note  that:  (i)  the  power  is  large  if  0*  is  far  from  #o,  and  (ii)  the 
power  is  large  if  the  sample  size  is  large. 

10.7  Example  (Comparing  Two  Prediction  Algorithms).  We  test  a  prediction 
algorithm  on  a  test  set  of  size  m  and  we  test  a  second  prediction  algorithm  on 
a  second  test  set  of  size  n.  Let  X  be  the  number  of  incorrect  predictions  for 
algorithm  1  and  let  Y  be  the  number  of  incorrect  predictions  for  algorithm 
2.  Then  X  ~  Binomial(m,pi)  and  Y  ~  Binomial(n,p2).  To  test  the  null 
hypothesis  that  pi  =  P2  write 


H0  :  5  =  0  versus  Hi  :  5  ^  0 

where  5  =  pi  —  p2.  The  mle  is  5  =  pi  —  p2  with  estimated  standard  error 


pi(i  -  pi)  p2( i  -p2) 

se  =  \i - 1 - 


rn 


n 


The  size  a  Wald  test  is  to  reject  Hq  when  \W\  >  z^j 2  where 


W 


5-0 


Pi  -P2 


se 


Pl(l~Pl)  p2(l~P2) 


m 


n 


The  power  of  this  test  will  be  largest  when  pi  is  far  from  p2  and  when  the 
sample  sizes  are  large. 

What  if  we  used  the  same  test  set  to  test  both  algorithms?  The  two  samples 
are  no  longer  independent.  Instead  we  use  the  following  strategy.  Let  Xi  =  1 
if  algorithm  1  is  correct  on  test  case  i  and  Xi  =  0  otherwise.  Let  Yi  =  1  if 
algorithm  2  is  correct  on  test  case  and  Yi  =  0  otherwise.  Define  Di  =  Xi—Yi . 
A  typical  dataset  will  look  something  like  this: 


Test  Case 

Xi 

Yi 

Di  =  Xi  —  Yi 

1 

1 

0 

1 

2 

1 

1 

0 

3 

1 

1 

0 

4 

0 

1 

-1 

5 

0 

0 

0 

n 

0 

1 

-1 

Let 

<5  =  E(A)  =  EpfO  -  E(V)  =  PpQ  =  1)  -  P(V  =  1). 

The  nonparametric  plug-in  estimate  of  5  is  5  =  D  =  n-1  Y^i=i  and  se(5)  = 
S/y/n,  where  S 2  =  n_1  Y^=i(Di  —  D )2.  To  test  Hq  :  5  =  0  versus  H\  :  5  ^  0 


10.1  The  Wald  Test 
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we  use  W  =  5/se  and  reject  Hq  if  \W\  >  za/2.  This  is  called  a  paired 
comparison.  ■ 

10.8  Example  (Comparing  Two  Means).  Let  and  Yi,  . . Yn  be 

two  independent  samples  from  populations  with  means  fi \  and  /i2,  respec¬ 
tively.  Let’s  test  the  null  hypothesis  that  /ii  =  /i2-  Write  this  as  Hq  :  5  =  0 
versus  Hi  :  5  ^  0  where  5  =  pi  —  /i2.  Recall  that  the  nonparametric  plug-in 
estimate  of  5  is  5  =  X  —  Y  with  estimated  standard  error 


where  s\  and  s\  are  the  sample  variances.  The  size  a  Wald  test  rejects  Hq 
when  \W\  >  za/2  where 


10.9  Example  (Comparing  Two  Medians).  Consider  the  previous  example  again 
but  let  us  test  whether  the  medians  of  the  two  distributions  are  the  same. 
Thus,  Hq  :  5  =  0  versus  Hi  :  5  /  0  where  S  =  ui  —  z/2  where  ui  and  z/2  are 
the  medians.  The  nonparametric  plug-in  estimate  of  5  is  5  =  Di  —  z/2  where  Di 
and  z/2  are  the  sample  medians.  The  estimated  standard  error  se  of  5  can  be 
obtained  from  the  bootstrap.  The  Wald  test  statistic  is  W  =  5/se.  u 

There  is  a  relationship  between  the  Wald  test  and  the  1  —  a  asymptotic 
confidence  interval  0  ±  s eza/2. 

10.10  Theorem.  The  size  a  Wald  test  rejects  Hq  :  6  =  6q  versus  Hi  :  6  ^  0q 
if  and  only  if  0q  ^  C  where 

C=(0  —  seza/2j  0  +  seza/2). 

Thus,  testing  the  hypothesis  is  equivalent  to  checking  whether  the  null  value 
is  in  the  confidence  interval. 

Warning!  When  we  reject  Hq  we  often  say  that  the  result  is  statistically 
significant.  A  result  might  be  statistically  significant  and  yet  the  size  of  the 
effect  might  be  small.  In  such  a  case  we  have  a  result  that  is  statistically  sig¬ 
nificant  but  not  scientifically  or  practically  significant.  The  difference  between 
statistical  significance  and  scientific  significance  is  easy  to  understand  in  light 
of  Theorem  10.10.  Any  confidence  interval  that  excludes  0q  corresponds  to  re¬ 
jecting  Hq.  But  the  values  in  the  interval  could  be  close  to  0q  (not  scientifically 
significant)  or  far  from  0q  (scientifically  significant).  See  Figure  10.2. 
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- 1 - ■ - ►  e 

o0 

FIGURE  10.2.  Scientific  significance  versus  statistical  significance.  A  level  a  test 
rejects  Ho  :  0  =  0o  if  and  only  if  the  1  —  a  confidence  interval  does  not  include 
So.  Here  are  two  different  confidence  intervals.  Both  exclude  0o  so  in  both  cases  the 
test  would  reject  Ho.  But  in  the  first  case,  the  estimated  value  of  0  is  close  to  0o  so 
the  finding  is  probably  of  little  scientific  or  practical  value.  In  the  second  case,  the 
estimated  value  of  0  is  far  from  0o  so  the  finding  is  of  scientific  value.  This  shows 
two  things.  First,  statistical  significance  does  not  imply  that  a  finding  is  of  scientific 
importance.  Second,  confidence  intervals  are  often  more  informative  than  tests. 


10.2  p- values 

Reporting  “reject  iFo”  or  “retain  iFo”  is  not  very  informative.  Instead,  we 
could  ask,  for  every  a,  whether  the  test  rejects  at  that  level.  Generally,  if  the 
test  rejects  at  level  a  it  will  also  reject  at  level  a!  >  a.  Hence,  there  is  a 
smallest  a  at  which  the  test  rejects  and  we  call  this  number  the  p-value.  See 
Figure  10.3. 


10.11  Definition.  Suppose  that  for  every  a  E  (0, 1)  we  have  a  size  a  test 
with  rejection  region  Ra.  Then, 

p-value  =  inf|<a  :  T(Xn)  E  Ra 

That  is,  the  p-value  is  the  smallest  level  at  which  we  can  reject  Hq. 


Informally,  the  p-value  is  a  measure  of  the  evidence  against  Ho:  the  smaller 
the  p-value,  the  stronger  the  evidence  against  Hq.  Typically,  researchers  use 
the  following  evidence  scale: 


10.2  p- values 
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Yes 


* 


Reject? 


No 


0  t 


H - ► 

1  a 


p-value 

FIGURE  10.3.  p-values  explained.  For  each  a  we  can  ask:  does  our  test  reject  Ho 
at  level  a ?  The  p-value  is  the  smallest  a  at  which  we  do  reject  Ho.  If  the  evidence 
against  Hq  is  strong,  the  p-value  will  be  small. 


p-value 
<  .01 
.01  -  .05 
.05  -  .10 
>  .1 


evidence 

very  strong  evidence  against  Hq 
strong  evidence  against  Hq 
weak  evidence  against  Hq 
little  or  no  evidence  against  Hq 


Warning!  A  large  p-value  is  not  strong  evidence  in  favor  of  Hq.  A  large 
p-value  can  occur  for  two  reasons:  (i)  Hq  is  true  or  (ii)  Hq  is  false  but  the  test 
has  low  power. 

Warning!  Do  not  confuse  the  p-value  with  P(i7o|Data).  2  The  p-value  is 
not  the  probability  that  the  null  hypothesis  is  true. 

The  following  result  explains  how  to  compute  the  p-value. 


2 We  discuss  quantities  like  P(ifo|Data)  in  the  chapter  on  Bayesian  inference. 
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10.12  Theorem.  Suppose  that  the  size  a  test  is  of  the  form 


reject  Ho  if  and  only  if  T(Xn)  >  cc 


Then , 


p- value  =  sup  ¥o(T(Xn)  >  T{xn )) 
6>e©0 


where  xn  is  the  observed  value  of  Xn .  If  ©o  =  {#o}  then 


p- value 


(T(Xn)  >T(xn)). 


We  can  express  Theorem  10.12  as  follows: 

The  p-value  is  the  probability  (under  Ho)  of  observing  a  value  of  the 
test  statistic  the  same  as  or  more  extreme  than  what  was  actually 
observed. 

10.13  Theorem.  Let  w  =  (9  —  9o)/se  denote  the  observed  value  of  the 
Wald  statistic  W.  The  p-value  is  given  by 

p  —  value  =  P0o(|W|  >  \w\)  ~  P(|Z|  >  |u;|)  =  2&(  —  \w\)  (10.7) 

where  Z  ~  7V(0, 1). 

To  understand  this  last  theorem,  look  at  Figure  10.4. 

Here  is  an  important  property  of  p-values. 

10.14  Theorem.  If  the  test  statistic  has  a  continuous  distribution ,  then  under 
Hq  :  6  =  6o,  the  p-value  has  a  Uniform  (0,1)  distribution.  Therefore,  if  we 
reject  Hq  when  the  p-value  is  less  than  a,  the  probability  of  a  type  I  error  is 

a. 

In  other  words,  if  Hq  is  true,  the  p-value  is  like  a  random  draw  from  a 
Unif(0, 1)  distribution.  If  Hi  is  true,  the  distribution  of  the  p-value  will  tend 
to  concentrate  closer  to  0. 

10.15  Example.  Recall  the  cholesterol  data  from  Example  7.15.  To  test  if  the 
means  are  different  we  compute 

Tir  ?-0  X-Y  216.2-195.3 
W  =  =  .  =  — ,  =  3.78. 

se  ./£ ?+fl  V52  +  2.42 

V  rn  '  n 


10.3  The  %2  Distribution 
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FIGURE  10.4.  The  p- value  is  the  smallest  a  at  which  you  would  reject  Ho.  To 
find  the  p- value  for  the  Wald  test,  we  find  a  such  that  \w\  and  —  \w\  are  just  at  the 
boundary  of  the  rejection  region.  Here,  w  is  the  observed  value  of  the  Wald  statistic: 
w  =  (0  —  9o)/se.  This  implies  that  the  p-value  is  the  tail  area  P(|Z|  >  |tc|)  where 
Z  ~  1V(0, 1). 


To  compute  the  p-value,  let  Z  ~  N( 0, 1)  denote  a  standard  Normal  random 
variable.  Then, 


p-value  =  P(|Z|  >  3.78)  =  2 P(Z  <  —3.78) 


.0002 


which  is  very  strong  evidence  against  the  null  hypothesis.  To  test  if  the  me¬ 
dians  are  different,  let  v\  and  denote  the  sample  medians.  Then, 

212.5  -  194 


W  = 


vi  -  v2 


=  2.4 


se  7.7 

where  the  standard  error  7.7  was  found  using  the  bootstrap.  The  p-value  is 

p-value  =  P(|Z|  >  2.4)  =  2 ¥(Z  <  —2.4)  =  .02 


which  is  strong  evidence  against  the  null  hypothesis. 


10.3  The  x2  Distribution 

Before  proceeding  we  need  to  discuss  the  x2  distribution.  Let  Zi, . . . ,  Z^  be 
independent,  standard  Normals.  Let  V  =  Yli=i  % i  •  Then  we  say  that  V  has 
a  x2  distribution  with  k  degrees  of  freedom,  written  V  ~  xt-  The  probability 
density  of  V  is 

v{k/2)-le-v/2 

=  2fc/2r(fc/2) 

for  v  >  0.  It  can  be  shown  that  E(U)  =  k  and  V(U)  =  2k.  We  define  the  upper 
a  quantile  Xk  a  =  T-1(l  —  a)  where  F  is  the  CDF.  That  is,  I P(xl  >  Xk  a)  =  a • 
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FIGURE  10.5.  The  p- value  is  the  smallest  a  at  which  we  would  reject  Ho.  To  find 

r\ 

the  p- value  for  the  Xk-i  test,  we  find  a  such  that  the  observed  value  t  of  the  test 
statistic  is  just  at  the  boundary  of  the  rejection  region.  This  implies  that  the  p- value 
is  the  tail  area  P(xifc-i  >  t). 

10.4  Pearson’s  %2  Test  For  Multinomial  Data 

Pearson’s  y2  test  is  used  for  multinomial  data.  Recall  that  if  X  =  (Xi, . . . ,  X&) 
has  a  multinomial  (n,p)  distribution,  then  the  mle  of  p  is  p  =  (pi, . . .  ,p^)  = 
(Xi/n, . . .  ,Xk/ri). 

Let  po  =  (poi ?  •  •  •  ,Po k)  be  some  fixed  vector  and  suppose  we  want  to  test 

Hq  :  p  =  po  versus  H\  :  p  ^  po. 


10.16  Definition.  Pearson’s  y2  statistic  is 


k  /  \2  k  /  v  771  \  2 


3  = 1 


(Xj  -  nppj ) 
npoj 


E 

3  = 1 


(M  ~  Ej) 
Ei 


where  Ej  =  E (Xj)  =  npoj  is  the  expected  value  of  Xj  under  Hq 


10.17  Theorem.  Under  Hq,  T  x\_1.  Hence  the  test:  reject  Ho  if  T  > 
Xk-i  a  has  asymptotic  level  a.  The  p-value  is  E(x|_i  >  t)  where  t  is  the 
observed  value  of  the  test  statistic. 

Theorem  10.17  is  illustrated  in  Figure  10.5. 


10.5  The  Permutation  Test 
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10.18  Example  (Mendel’s  peas).  Mendel  bred  peas  with  round  yellow  seeds 
and  wrinkled  green  seeds.  There  are  four  types  of  progeny:  round  yellow, 
wrinkled  yellow,  round  green,  and  wrinkled  green.  The  number  of  each  type 
is  multinomial  with  probability  p  =  (pi,T2>T3?T4)-  His  theory  of  inheritance 
predicts  that  p  is  equal  to 


To 


9  3  3  1 

16’  16’  16’  16 


In  n  =  556  trials  he  observed  X  =  (315, 101, 108,  32).  We  will  test  Ho  :  p  =  po 
versus  H\  \  p  ^  po.  Since,  npoi  =  312.75,  npo2  =  ^To3  =  104.25,  and  npo4  = 
34.75,  the  test  statistic  is 


X 


2 


(315  -  312. 75)2  (  (101  -  104.25)2 

312.75  h  104.25 
(108  -  104.25)2  (32  -  34.75)2 

+  104.25  +  34.75 


0.47. 


The  a  =  .05  value  for  a  y2  is  7.815.  Since  0.47  is  not  larger  than  7.815  we  do 
not  reject  the  null.  The  p- value  is 


p-value  =  P(y2  >  .47)  =  .93 


which  is  not  evidence  against  Ho.  Hence,  the  data  do  not  contradict  Mendel’s 
theory.3* 

In  the  previous  example,  one  could  argue  that  hypothesis  testing  is  not  the 
right  tool.  Hypothesis  testing  is  useful  to  see  if  there  is  evidence  to  reject  Ho . 
This  is  appropriate  when  Hq  corresponds  to  the  status  quo.  It  is  not  useful  for 
proving  that  Hq  is  true.  Failure  to  reject  Hq  might  occur  because  Hq  is  true, 
but  it  might  occur  just  because  the  test  has  low  power.  Perhaps  a  confidence 
set  for  the  distance  between  p  and  po  might  be  more  useful  in  this  example. 


10.5  The  Permutation  Test 

The  permutation  test  is  a  nonparametric  method  for  testing  whether  two 
distributions  are  the  same.  This  test  is  “exact,”  meaning  that  it  is  not  based 
on  large  sample  theory  approximations.  Suppose  that  X\,  . . .,  Xm  ~  Fx  and 
Yi,  . . .,  Yn  ~  Fy  are  two  independent  samples  and  Hq  is  the  hypothesis  that 


3 


There  is  some  controversy  about  whether  Mendel’s  results  are  “too  good.” 
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the  two  samples  are  identically  distributed.  This  is  the  type  of  hypothesis  we 
would  consider  when  testing  whether  a  treatment  differs  from  a  placebo.  More 
precisely  we  are  testing 


H0  :  Fx  =  Fy  versus  Hi  :  Fx  ^  Fy . 


Let  T(x i, . . . ,  xm,  i/i, . . . ,  yn)  be  some  test  statistic,  for  example, 


T(X 


l?  •  • 


.,X 


m  ? 


Let  N  =  m  +  n  and  consider  forming  all  iV!  permutations  of  the  data  Xy  . . ., 
Xm,Yi,  . . .,  rn.  For  each  permutation,  compute  the  test  statistic  T.  Denote 
these  values  by  Xi, . . . ,  Ty\.  Under  the  null  hypothesis,  each  of  these  values  is 
equally  likely.  4  The  distribution  Po  that  puts  mass  1  /N\  on  each  Tj  is  called 
the  permutation  distribution  of  T.  Let  tQ bs  be  the  observed  value  of  the 
test  statistic.  Assuming  we  reject  when  T  is  large,  the  p-value  is 

n\ 

p- value  =  P 0(T  >  tobs)  =  —  >  tobs). 


10.19  Example.  Here  is  a  toy  example  to  make  the  idea  clear.  Suppose  the 


data  are:  (Xi,X2,Yi) 
permutations  are: 


(1,9,3).  Let  T(Xi,X2,Yi)  =  \X  -  Y\  =  2.  The 


permutation  value  of  T  probability 


(1,9,3) 

2 

1/6 

(9,1,3) 

2 

1/6 

(1,3,9) 

7 

1/6 

(3,1,9) 

7 

1/6 

(3,9,1) 

5 

1/6 

(9,3,1) 

5 

1/6 

The  p-value  is  P(T  >  2)  =  4/6.  ■ 


Usually,  it  is  not  practical  to  evaluate  all  N\  permutations.  We  can  approx¬ 
imate  the  p-value  by  sampling  randomly  from  the  set  of  permutations.  The 
fraction  of  times  Tj  >  t0bs  among  these  samples  approximates  the  p-value. 


4  More  precisely,  under  the  null  hypothesis,  given  the  ordered  data  values, 
Xi, . . .  ,  Xm,  Yi, . .  . ,  Yn  is  uniformly  distributed  over  the  N\  permutations  of  the  data. 


10.5  The  Permutation  Test 
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Algorithm  for  Permutation  Test 

1.  Compute  the  observed  value  of  the  test  statistic 

^obs  , . . . ,  Xm ,  ^ ...  ^  Yfi). 

2.  Randomly  permute  the  data.  Compute  the  statistic  again  using  the 

permuted  data. 

3.  Repeat  the  previous  step  B  times  and  let  Ti, . . . ,  Tb  denote  the 

resulting  values. 

4.  The  approximate  p- value  is 

1  ^ 

3  = 1 


10.20  Example.  DNA  microarrays  allow  researchers  to  measure  the  expres¬ 
sion  levels  of  thousands  of  genes.  The  data  are  the  levels  of  messenger  RNA 
(mRNA)  of  each  gene,  which  is  thought  to  provide  a  measure  of  how  much 
protein  that  gene  produces.  Roughly,  the  larger  the  number,  the  more  active 
the  gene.  The  table  below,  reproduced  from  Efron  et  al.  (2001)  shows  the 
expression  levels  for  genes  from  ten  patients  with  two  types  of  liver  cancer 
cells.  There  are  2,638  genes  in  this  experiment  but  here  we  show  just  the  first 
two.  The  data  are  log-ratios  of  the  intensity  levels  of  two  different  color  dyes 
used  on  the  arrays. 


Patient 
Gene  1 
Gene  2 


_ Type  I _ 

12  3  4 

230  -1,350  -1,580  ^400 

470  -850  -.8  -280 


6 


■760 

120 


970 

390 


Type  II 

7  8  9 

110  450  G90 

-1730  -1360  -1 


10 

4200 

-330 


Let’s  test  whether  the  median  level  of  gene  1  is  different  between  the  two 
groups.  Let  v\  denote  the  median  level  of  gene  1  of  Type  I  and  let  denote  the 
median  level  of  gene  1  of  Type  II.  The  absolute  difference  of  sample  medians 
is  T  =  | Pi  —  P2I  =  710.  Now  we  estimate  the  permutation  distribution  by 
simulation  and  we  find  that  the  estimated  p- value  is  .045.  Thus,  if  we  use  a 
a  =  .05  level  of  significance,  we  would  say  that  there  is  evidence  to  reject  the 
null  hypothesis  of  no  difference.  ■ 
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In  large  samples,  the  permutation  test  usually  gives  similar  results  to  a  test 
that  is  based  on  large  sample  theory.  The  permutation  test  is  thus  most  useful 
for  small  samples. 

10.6  The  Likelihood  Ratio  Test 

The  Wald  test  is  useful  for  testing  a  scalar  parameter.  The  likelihood  ratio 
test  is  more  general  and  can  be  used  for  testing  a  vector- valued  parameter. 


10.21  Definition.  Consider  testing 


Hq  :  6  G  ©o  versus  Hi  :  6  ^  ©( 


The  likelihood  ratio  statistic  is 


A  =  21ogfSUP«e£<fO=21og(^g- 

\supeeeo  £(9)  J  \£(0O) 


where  6  is  the  mle  and  Oq  is  the  mle  when  6  is  restricted  to  lie  in  ©o 


You  might  have  expected  to  see  the  maximum  of  the  likelihood  over  ©q 
instead  of  ©  in  the  numerator.  In  practice,  replacing  ©q  with  ©  has  little 
effect  on  the  test  statistic.  Moreover,  the  theoretical  properties  of  A  are  much 
simpler  if  the  test  statistic  is  defined  this  way. 

The  likelihood  ratio  test  is  most  useful  when  ©o  consists  of  all  parameter 
values  6  such  that  some  coordinates  of  6  are  fixed  at  particular  values. 

10.22  Theorem.  Suppose  that  6  =  (6h, . . . ,  0q,  6q+ 1, . . . ,  0r).  Let 

©0  \0  .  ($g_|_i,  .  .  .  ,  $r)  (^0,g+l?  •  •  •  ?  ^0 ,r)}« 

Let  A  be  the  likelihood  ratio  test  statistic.  Under  Hq  :  6  E  ©o, 

A(xn)  Xr-q,a 

where  r  —  q  is  the  dimension  of  ©  minus  the  dimension  of  ©o-  The  p-value 
for  the  test  is  F (Xr-q  >  A). 

For  example,  if  6  =  (0 1,  ^2,  ^3,  ^4,  6^5)  and  we  want  to  test  the  null  hypothesis 
that  6>4  =  6>5  =  0  then  the  limiting  distribution  has  5  —  3  =  2  degrees  of 
freedom. 


10.7  Multiple  Testing 
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10.23  Example  (Mendel’s  Peas  Revisited).  Consider  example  10.18  again.  The 
likelihood  ratio  test  statistic  for  Hq  :  p  =  po  versus  Hi  :  p  ^  po  is 


A 


2  log 


£(P)\ 

£(po)J 


4 


2  X3  l0g 

3  = 1 


Pj_ 

POj 


2 


315  log 


315 

556 

_9_ 

16 


+  101  log 


+108  log 
0.48. 


108 

556 

_3_ 

16 


+  32  log 


Under  Hi  there  are  four  parameters.  However,  the  parameters  must  sum  to 
one  so  the  dimension  of  the  parameter  space  is  three.  Under  Ho  there  are  no 
free  parameters  so  the  dimension  of  the  restricted  parameter  space  is  zero.  The 
difference  of  these  two  dimensions  is  three.  Therefore,  the  limiting  distribution 
of  A  under  Ho  is  xl  and  the  p- value  is 

p-value  =  P(%3  >  .48)  =  .92. 

The  conclusion  is  the  same  as  with  the  x2  test.  ■ 

When  the  likelihood  ratio  test  and  the  x2  test  are  both  applicable,  as  in  the 
last  example,  they  usually  lead  to  similar  results  as  long  as  the  sample  size  is 
large. 


10.7  Multiple  Testing 

In  some  situations  we  may  conduct  many  hypothesis  tests.  In  example  10.20, 
there  were  actually  2,638  genes.  If  we  tested  for  a  difference  for  each  gene, 
we  would  be  conducting  2,638  separate  hypothesis  tests.  Suppose  each  test 
is  conducted  at  level  a.  For  any  one  test,  the  chance  of  a  false  rejection  of 
the  null  is  a.  But  the  chance  of  at  least  one  false  rejection  is  much  higher. 
This  is  the  multiple  testing  problem.  The  problem  comes  up  in  many  data 
mining  situations  where  one  may  end  up  testing  thousands  or  even  millions  of 
hypotheses.  There  are  many  ways  to  deal  with  this  problem.  Here  we  discuss 
two  methods. 
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Consider  m  hypothesis  tests: 

Hm  versus  Hu ,  i  =  1, . . . ,  m 
and  let  Pi, ... ,  Pm  denote  the  m  p- values  for  these  tests. 

The  Bonferroni  Method 

Given  p- values  Pi, ... ,  Pm,  reject  null  hypothesis  if 

_  a 

Pi  <  — • 

m 


10.24  Theorem.  Using  the  Bonferroni  method ,  the  probability  of  falsely  re¬ 
jecting  any  null  hypotheses  is  less  than  or  equal  to  a. 

Proof.  Let  R  be  the  event  that  at  least  one  null  hypothesis  is  falsely 
rejected.  Let  Ri  be  the  event  that  the  ith  null  hypothesis  is  falsely  rejected. 
Recall  that  if  Ai, . . . ,  are  events  then  P(IJ^=i  Af)  <  Yli=i  Hence, 

(rri  \  rn  m 

<EP (^)  =  E^  =  a 

i=  1  /  i= 1  i=l 

from  Theorem  10.14.  ■ 


10.25  Example.  In  the  gene  example,  using  a  =  .05,  we  have  that  .05/2,  638  = 
.00001895375.  Hence,  for  any  gene  with  p-value  less  than  .00001895375,  we 
declare  that  there  is  a  significant  difference.  ■ 


The  Bonferroni  method  is  very  conservative  because  it  is  trying  to  make 
it  unlikely  that  you  would  make  even  one  false  rejection.  Sometimes,  a  more 
reasonable  idea  is  to  control  the  false  discovery  rate  (FDR)  which  is  de¬ 
fined  as  the  mean  of  the  number  of  false  rejections  divided  by  the  number  of 
rejections. 

Suppose  we  reject  all  null  hypotheses  whose  p- values  fall  below  some  thresh¬ 
old.  Let  m o  be  the  number  of  null  hypotheses  that  are  true  and  let  mi  = 
m  —  rriQ.  The  tests  can  be  categorized  in  a  2  x  2  as  in  Table  10.2. 

Define  the  False  Discovery  Proportion  (FDP) 


FDP 


V/R  if  R  >  0 
0  if  R  =  0. 


The  FDP  is  the  proportion  of  rejections  that  are  incorrect.  Next  define  FDR  = 
E(FDP). 


10.7  Multiple  Testing 
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Ho  Not  Rejected 

Ho  Rejected 

Total 

H0  True 

u 

V 

mo 

Ho  False 

T 

s 

mi 

Total 

m  —  R 

R 

m 

TABLE  10.2.  Types  of  outcomes  in  multiple  testing. 


The  Benjamini-Hochberg  (BH)  Method 

1.  Let  P( i)  <  •  •  •  <  P(m)  denote  the  ordered  p- values. 

2.  Define 

icy,  \ 

£i  =  — - ,  and  R  =  max<  i  :  Pu\  <  li 

Cmm  { 

where  Cm  is  defined  to  be  1  if  the  p- values  are  independent  and 
Cm  =  X^i(lA)  otherwise. 

3.  Let  T  =  P(R)]  we  call  T  the  BH  rejection  threshold. 

4.  Reject  all  null  hypotheses  Hoi  for  which  Pi  <  T. 


10.26  Theorem  (Benjamini  and  Hochberg).  If  the  procedure  above  is  applied, 
then  regardless  of  how  many  nulls  are  true  and  regardless  of  the  distribution 
of  the  p-values  when  the  null  hypothesis  is  false, 

FDR  =  E(FDP)  <  —a  <  a. 

m 

10.27  Example.  Figure  10.6  shows  six  ordered  p-values  plotted  as  vertical 
lines.  If  we  tested  at  level  a  without  doing  any  correction  for  multiple  testing, 
we  would  reject  all  hypotheses  whose  p-values  are  less  than  a.  In  this  case, 
the  four  hypotheses  corresponding  to  the  four  smallest  p-values  are  rejected. 
The  Bonferroni  method  rejects  all  hypotheses  whose  p-values  are  less  than 
a/m.  In  this  case,  this  leads  to  no  rejections.  The  BH  threshold  corresponds 
to  the  last  p- value  that  falls  under  the  line  with  slope  a.  This  leads  to  two 
hypotheses  being  rejected  in  this  case.  ■ 

10.28  Example.  Suppose  that  10  independent  hypothesis  tests  are  carried 
leading  to  the  following  ordered  p-values: 


0.00017  0.00448  0.00671  0.00907  0.01220 
0.33626  0.39341  0.53882  0.58125  0.98617 
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FIGURE  10.6.  The  Benjamini-Hochberg  (BH)  procedure.  For  uncorrected  testing 
we  reject  when  Pi  <  a.  For  Bonferroni  testing  we  reject  when  Pi  <  a/m.  The  BH 
procedure  rejects  when  Pi  <  T.  The  BH  threshold  T  corresponds  to  the  rightmost 
undercrossing  of  the  upward  sloping  line. 

With  a  =  0.05,  the  Bonferroni  test  rejects  any  hypothesis  whose  p- value  is 
less  than  a/10  =  0.005.  Thus,  only  the  first  two  hypotheses  are  rejected.  For 
the  BH  test,  we  find  the  largest  i  such  that  P^  <  ia/m ,  which  in  this  case  is 
i  =  5.  Thus  we  reject  the  first  five  hypotheses.  ■ 


10.8  Goodness-of-fit  Tests 

There  is  another  situation  where  testing  arises,  namely,  when  we  want  to  check 
whether  the  data  come  from  an  assumed  parametric  model.  There  are  many 
such  tests;  here  is  one. 

Let  $  =  {f(x;  0)  :  6  G  ©}  be  a  parametric  model.  Suppose  the  data  take 
values  on  the  real  line.  Divide  the  line  into  k  disjoint  intervals  For 

j  =  1, . . . ,  /c,  let 

Pj(9)  =  [  f(x;6)dx 

be  the  probability  that  an  observation  falls  into  interval  Ij  under  the  assumed 
model.  Here,  6  =  (6h, . . . ,  6S)  are  the  parameters  in  the  assumed  model.  Let 
Nj  be  the  number  of  observations  that  fall  into  Ij .  The  likelihood  for  9  based 
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on  the  counts  Ni , . . . ,  is  the  multinomial  likelihood 

k 

Q(0)  =  i[M0)Nj ■ 

3  = 1 


Maximizing  Q{6)  yields  estimates  6  =  (6h, . . . ,  6S)  of  6.  Now  define  the  test 
statistic 


h 

10.29  Theorem.  Let  Ho  be  the  null  hypothesis  that  the  data  are  HD  draws  from 
the  model  $  =  {/(#;  6)  :  6  E  ©}.  Under  H  —  0,  the  statistic  Q  defined  in 
equation  (10.9)  converges  in  distribution  to  a  xt-i-s  random  variable.  Thus , 
the  (approximate)  p-value  for  the  test  is  >  q)  where  q  denotes  the 

observed  value  of  Q. 


It  is  tempting  to  replace  6  in  (10.9)  with  the  mle  6.  However,  this  will  not 
result  in  a  statistic  whose  limiting  distribution  is  a  Xk_i_s-  However,  it  can 
be  shown  —  due  to  a  theorem  of  Herman  Chernoff  and  Erich  Lehmann  from 
1954  —  that  the  p-value  is  bounded  approximately  by  the  p- values  obtained 
using  a  \l- i-s  and  a  Xfe-i- 

Goodness-of-ht  testing  has  some  serious  limitations.  If  reject  Ho  then  we 
conclude  we  should  not  use  the  model.  But  if  we  do  not  reject  Ho  we  can¬ 
not  conclude  that  the  model  is  correct.  We  may  have  failed  to  reject  simply 
because  the  test  did  not  have  enough  power.  This  is  why  it  is  better  to  use 
nonparametric  methods  whenever  possible  rather  than  relying  on  parametric 
assumptions. 


10.9  Bibliographic  Remarks 

The  most  complete  book  on  testing  is  Lehmann  (1986).  See  also  Chapter  8  of 
Casella  and  Berger  (2002)  and  Chapter  9  of  Rice  (1995).  The  FDR  method  is 
due  to  Benjamini  and  Hochberg  (1995).  Some  of  the  exercises  are  from  Rice 
(1995). 
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10.10  Appendix 

10.10.1  The  Neyman- Pears  on  Lemma 

In  the  special  case  of  a  simple  null  Ho  :  9  =  6$  and  a  simple  alternative 
H\  :  6  =  0\  we  can  say  precisely  what  the  most  powerful  test  is. 

10.30  Theorem  (Neyman-Pearson).  Suppose  we  test  Ho  :  9  =  Qq  versus  Hi  : 
0  =  0i.  Let 

=  C{6Q  = 

m)  nhn^ooY 

Suppose  we  reject  Hq  when  T  >  k.  If  we  choose  k  so  that  P e0(T  >  k)  =  a 
then  this  test  is  the  most  powerful ,  size  a  test.  That  is,  among  all  tests  with 
size  a,  this  test  maximizes  the  power  (3(0 1). 


10.10.2  The  t-test 

To  test  Hq  :  /i  =  fio  where  /i  =  E(X^)  is  the  mean,  we  can  use  the  Wald  test. 
When  the  data  are  assumed  to  be  Normal  and  the  sample  size  is  small,  it  is 
common  instead  to  use  the  t-test.  A  random  variable  T  has  a  t- distribution 
with  k  degrees  of  freedom  if  it  has  density 


r(Mi) 


VA^r(f)  (i  +  f)(fc+1)/2’ 


m 


When  the  degrees  of  freedom  k  oo,  this  tends  to  a  Normal  distribution. 
When  k  =  1  it  reduces  to  a  Cauchy. 

Let  Xi, . . . ,  Xn  ~  N(ja,  a2)  where  6  =  (/i,  a2)  are  both  unknown.  Suppose 
we  want  to  test  fi  =  fio  versus  ji  ^  /xq.  Let 


T  = 


s/n(Xn  -  no) 


S. 


n 


where  S2  is  the  sample  variance.  For  large  samples  T  «  7V(0, 1)  under  Hq- 
The  exact  distribution  of  T  under  Hq  is  tn- 1.  Hence  if  we  reject  when  \T\  > 
tn-i,oL/ 2  then  we  get  a  size  a  test.  However,  when  n  is  moderately  large,  the 
t-test  is  essentially  identical  to  the  Wald  test. 


10.11  Exercises 


1.  Prove  Theorem  10.6. 
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2.  Prove  Theorem  10.14. 

3.  Prove  Theorem  10.10. 

4.  Prove  Theorem  10.12. 

5.  Let  Xi,  ...,Xn  ^  Uniform(0,  6)  and  let  Y  =  max{Xi,  ...,Xn}.  We  want 
to  test 

H0  :  0  =  1/2  versus  Hi  :  9  >  1/2. 

The  Wald  test  is  not  appropriate  since  Y  does  not  converge  to  a  Normal. 
Suppose  we  decide  to  test  this  hypothesis  by  rejecting  Ho  when  Y  >  c. 

(a)  Find  the  power  function. 

(b)  What  choice  of  c  will  make  the  size  of  the  test  .05? 

(c)  In  a  sample  of  size  n  =  20  with  Y=0.48  what  is  the  p-value?  What 
conclusion  about  Ho  would  you  make? 

(d)  In  a  sample  of  size  n  =  20  with  Y=0.52  what  is  the  p-value?  What 
conclusion  about  Ho  would  you  make? 

6.  There  is  a  theory  that  people  can  postpone  their  death  until  after  an 
important  event.  To  test  the  theory,  Phillips  and  King  (1988)  collected 
data  on  deaths  around  the  Jewish  holiday  Passover.  Of  1919  deaths,  922 
died  the  week  before  the  holiday  and  997  died  the  week  after.  Think  of 
this  as  a  binomial  and  test  the  null  hypothesis  that  9  =  1/2.  Report  and 
interpret  the  p-value.  Also  construct  a  confidence  interval  for  6. 

7.  In  1861,  10  essays  appeared  in  the  New  Orleans  Daily  Crescent  They 
were  signed  “Quintus  Curtius  Snodgrass”  and  some  people  suspected 
they  were  actually  written  by  Mark  Twain.  To  investigate  this,  we  will 
consider  the  proportion  of  three  letter  words  found  in  an  author’s  work. 
From  eight  Twain  essays  we  have: 

.225  .262  .217  .240  .230  .229  .235  .217 

From  10  Snodgrass  essays  we  have: 

.209  .205  .196  .210  .202  .207  .224  .223  .220  .201 

(a)  Perform  a  Wald  test  for  equality  of  the  means.  Use  the  nonpar amet- 
ric  plug-in  estimator.  Report  the  p-value  and  a  95  per  cent  confidence 
interval  for  the  difference  of  means.  What  do  you  conclude? 

(b)  Now  use  a  permutation  test  to  avoid  the  use  of  large  sample  methods. 
What  is  your  conclusion?  (Brinegar  (1963)). 
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8.  Let  Xi, . . . ,  Xn  N{6 , 1).  Consider  testing 

H0  :  0  =  0  versus  0  =  1. 

Let  the  rejection  region  be  R  =  {xn  :  T(xn)  >  c}  where  T(xn )  = 
n-'ElUW 

(a)  Find  c  so  that  the  test  has  size  a. 

(b)  Find  the  power  under  Hi,  that  is,  find  (3(  1). 

(c)  Show  that  (3(1)  1  as  n  — )►  oo. 

9.  Let  0  be  the  mle  of  a  parameter  0  and  let  se  =  {nl(0)}  where  1(0) 
is  the  Fisher  information.  Consider  testing 

H0  :  0  =  Oo  versus  0  ^  Oo. 

Consider  the  Wald  test  with  rejection  region  R  =  {xn  :  \Z\  >  za/2} 

where  Z  =  (0  —  Oo)/se.  Let  0\  >  be  some  alternative.  Show  that 

/?(0i)  1- 

10.  Here  are  the  number  of  elderly  Jewish  and  Chinese  women  who  died 
just  before  and  after  the  Chinese  Harvest  Moon  Festival. 


Week 

Chinese 

Jewish 

-2 

55 

141 

-1 

33 

145 

1 

70 

139 

2 

49 

161 

Compare  the  two  mortality  patterns.  (Phillips  and  Smith  (1990)). 

11.  A  randomized,  double-blind  experiment  was  conducted  to  assess  the 
effectiveness  of  several  drugs  for  reducing  postoperative  nausea.  The 
data  are  as  follows. 


Number  of  Patients 

Incidence  of  Nausea 

Placebo 

80 

45 

Chlorpromazine 

75 

26 

Dimenhydrinate 

85 

52 

Pentobarbital  (100  mg) 

67 

35 

Pentobarbital  (150  mg) 

85 

37 

10.11  Exercises 
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(a)  Test  each  drug  versus  the  placebo  at  the  5  per  cent  level.  Also,  report 
the  estimated  odds-ratios.  Summarize  your  Endings. 

(b)  Use  the  Bonferroni  and  the  FDR  method  to  adjust  for  multiple 
testing.  (Beecher  (1959)). 

12.  Let  Xi,...,Xn  ^  Poisson(A). 

(a)  Let  Ao  0.  Find  the  size  ql  Wald  test  for 

H0  :  A  =  Ao  versus  Hi  :  A  ^  Ao- 

(b)  (Computer  Experiment.)  Let  Ao  =  1,  n  =  20  and  a  =  .05.  Simulate 
Xi,...,Xn  ^  Poisson(Ao)  and  perform  the  Wald  test.  Repeat  many 
times  and  count  how  often  you  reject  the  null.  How  close  is  the  type  I 
error  rate  to  .05? 

13.  Let  Xi, . . . ,  Xn  ~  N(/jL ,  a2).  Construct  the  likelihood  ratio  test  for 

H0  :  ji  =  j±o  versus  H\  :  /i  ^  /iq. 

Compare  to  the  Wald  test. 

14.  Let  Xi, . . . ,  Xn  ~  iV(/i,  a2).  Construct  the  likelihood  ratio  test  for 

Ho  :  a  =  ao  versus  Hi  :  a  ^  ao- 
Compare  to  the  Wald  test. 

15.  Let  X  ~  Binomial(n,p).  Construct  the  likelihood  ratio  test  for 

Ho  :  p  =  po  versus  Hi  :  p  ^  po- 
Compare  to  the  Wald  test. 

16.  Let  6  be  a  scalar  parameter  and  suppose  we  test 

Ho  :  0  =  Oo  versus  Hi  :  0  ^  6o- 

Let  W  be  the  Wald  test  statistic  and  let  A  be  the  likelihood  ratio  test 
statistic.  Show  that  these  tests  are  equivalent  in  the  sense  that 

w2  p 

A 

as  Ti  — y  oo.  Hint:  Use  a  Taylor  expansion  of  the  log-likelihood  1(0)  to 
show  that 


A  «  Vn(0  —  0o) 
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Bayesian  Inference 


11.1  The  Bayesian  Philosophy 

The  statistical  methods  that  we  have  discussed  so  far  are  known  as  frequen- 
tist  (or  classical)  methods.  The  frequentist  point  of  view  is  based  on  the 
following  postulates: 


FI  Probability  refers  to  limiting  relative  frequencies.  Probabilities  are  ob¬ 
jective  properties  of  the  real  world. 

F2  Parameters  are  fixed,  unknown  constants.  Because  they  are  not  fluctu¬ 
ating,  no  useful  probability  statements  can  be  made  about  parameters. 

F3  Statistical  procedures  should  be  designed  to  have  well-defined  long  run 
frequency  properties.  For  example,  a  95  percent  confidence  interval  should 
trap  the  true  value  of  the  parameter  with  limiting  frequency  at  least  95 
percent. 


There  is  another  approach  to  inference  called  Bayesian  inference.  The 
Bayesian  approach  is  based  on  the  following  postulates: 
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B1  Probability  describes  degree  of  belief,  not  limiting  frequency.  As  such, 
we  can  make  probability  statements  about  lots  of  things,  not  just  data 
which  are  subject  to  random  variation.  For  example,  I  might  say  that 
“the  probability  that  Albert  Einstein  drank  a  cup  of  tea  on  August  1, 
1948”  is  .35.  This  does  not  refer  to  any  limiting  frequency.  It  reflects  my 
strength  of  belief  that  the  proposition  is  true. 

B2  We  can  make  probability  statements  about  parameters,  even  though 
they  are  fixed  constants. 

B3  We  make  inferences  about  a  parameter  9  by  producing  a  probability 
distribution  for  0.  Inferences,  such  as  point  estimates  and  interval  esti¬ 
mates,  may  then  be  extracted  from  this  distribution. 


Bayesian  inference  is  a  controversial  approach  because  it  inherently  em¬ 
braces  a  subjective  notion  of  probability.  In  general,  Bayesian  methods  pro¬ 
vide  no  guarantees  on  long  run  performance.  The  held  of  statistics  puts  more 
emphasis  on  frequentist  methods  although  Bayesian  methods  certainly  have 
a  presence.  Certain  data  mining  and  machine  learning  communities  seem  to 
embrace  Bayesian  methods  very  strongly.  Let’s  put  aside  philosophical  ar¬ 
guments  for  now  and  see  how  Bayesian  inference  is  done.  We’ll  conclude  this 
chapter  with  some  discussion  on  the  strengths  and  weaknesses  of  the  Bayesian 
approach. 


11.2  The  Bayesian  Method 

Bayesian  inference  is  usually  carried  out  in  the  following  way. 

1.  We  choose  a  probability  density  f(9)  —  called  the  prior  distribution 
—  that  expresses  our  beliefs  about  a  parameter  0  before  we  see  any 
data. 

2.  We  choose  a  statistical  model  f(x\6)  that  reflects  our  beliefs  about  x 
given  0.  Notice  that  we  now  write  this  as  f(x\6)  instead  of  f(x;  6). 

3.  After  observing  data  Xi, . . .  ,Xn,  we  update  our  beliefs  and  calculate 
the  posterior  distribution  f(6 |Xi, . . . ,  Xn). 

To  see  how  the  third  step  is  carried  out,  first  suppose  that  6  is  discrete  and 
that  there  is  a  single,  discrete  observation  X.  We  should  use  a  capital  letter 
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now  to  denote  the  parameter  since  we  are  treating  it  like  a  random  variable, 
so  let  ©  denote  the  parameter.  Now,  in  this  discrete  setting, 


P(0  =  6>|X  =  x)  = 


P(X  =  x,0  =  6) 

P(X  =  x ) 

P(X  =  x\Q  =  6>)P(0  =  0) 
J2eF(x  =  x  I©  =  6>)P(0  =  6>) 


which  you  may  recognize  from  Chapter  1  as  Bayes’  theorem.  The  version 
for  continuous  variables  is  obtained  by  using  density  functions: 


f{0\x)  = 


If  we  have  n  HD  observations  Xi, . . 


f(x\e)f(e) 

J  f(x\6)f(9)d9' 

.  ,Xn,  we  replace  f(x\9)  with 


(11.1) 


f(x1,...,xn\9)  =  Y[f(xi\9)  =Cn(6). 

i= 1 


Notation.  We  will  write  Xn  to  mean  (Xi, . . . ,  Xn)  and  xn  to  mean  (xi, . . . ,  x 
Now, 


/(%") 


f{xn\9)f{9)  Cn(9)f(9) 


f  f(x-\9)f(9)d9 


c 


<x  £n(9)f(0) 


(n.2) 


n 


where 


C„=  f  Cn(9)f(9)d9 


(IT. 3) 


is  called  the  normalizing  constant.  Note  that  cn  does  not  depend  on  0.  We 
can  summarize  by  writing: 


Posterior  is  proportional  to  Likelihood  times  Prior 

or,  in  symbols, 

f(9\xn)  cc  £(9)f(9). 


You  might  wonder,  doesn’t  it  cause  a  problem  to  throw  away  the  constant 
cn?  The  answer  is  that  we  can  always  recover  the  constant  later  if  we  need  to. 

What  do  we  do  with  the  posterior  distribution?  First,  we  can  get  a  point 
estimate  by  summarizing  the  center  of  the  posterior.  Typically,  we  use  the 
mean  or  mode  of  the  posterior.  The  posterior  mean  is 


0 


n 


9f(9\xn)d9 


f9£n(9)f(9) 
f£n(9)f(9)d9 ' 


(11.4) 
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We  can  also  obtain  a  Bayesian  interval  estimate.  We  find  a  and  b  such  that 
f(0\xn)d0  =  fb°°  f(0\xn)d0  =  a/2.  Let  C  =  (a,  b).  Then 


f(6\xn)  dO  =  1  - 


a 


so  C  is  a  1  —  a  posterior  interval. 


11.1  Example.  Let  Xi, . . .  ,Xn  ~  Bernoulli(p).  Suppose  we  take  the  uniform 
distribution  f(p)  =  1  as  a  prior.  By  Bayes’  theorem,  the  posterior  has  the 
form 


f(p\xn)  oc  f(p)£n(p)  =  ps(l -p)n  s=ps+1  1(l-p)n  S+1  1 


where  s  =  Y^i=i  Xi  the  number  of  successes.  Recall  that  a  random  variable 
has  a  Beta  distribution  with  parameters  a  and  (3  if  its  density  is 

We  see  that  the  posterior  for  p  is  a  Beta  distribution  with  parameters  5  +  1 
and  n  —  s  +  1.  That  is, 


f(p\xn) 


T(n  +  2) 


T(s  +  l)T(n  —  s  +  1) 


p(*+ D-i(i  _p)<" 


S+1)-1 


We  write  this  as 


p\xn  ~  Beta(s  +  1,  n 


5  +  1). 


Notice  that  we  have  figured  out  the  normalizing  constant  without  actually 
doing  the  integral  f  £n(p)f(p)dp.  The  mean  of  a  Beta (a,/3)  distribution  is 
a /(a  +  P)  so  the  Bayes  estimator  is 


V 


5  +  1 

n  +  2 


It  is  instructive  to  rewrite  the  estimator  as 


(11.5) 


P  =  KP+  (1  -  A n)p 


(11.6) 


where  p  —  s/n  is  the  mle,  p  —  1/2  is  the  prior  mean  and  An  =  n/(n  +  2)  w  1. 
A  95  percent  posterior  interval  can  be  obtained  by  numerically  finding  a  and 
b  such  that  J ^  f(p\xn )  dp  =  .95. 

Suppose  that  instead  of  a  uniform  prior,  we  use  the  prior  p  ~  Beta (a,/3). 
If  you  repeat  the  calculations  above,  you  will  see  that  p\xn  ~  Beta(ce  +  5,  P  + 
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n  —  s).  The  flat  prior  is  just  the  special  case  with  a  =  (3  =  1.  The  posterior 
mean  is 

P~  a  +  f3  +  n  ~  \a  +  P  +  n)  P  +  \a  +  f3  +  n)P° 
where  po  =  a/ (a  +  (3)  is  the  prior  mean.  ■ 


In  the  previous  example,  the  prior  was  a  Beta  distribution  and  the  posterior 
was  a  Beta  distribution.  When  the  prior  and  the  posterior  are  in  the  same 
family,  we  say  that  the  prior  is  conjugate  with  respect  to  the  model. 

11.2  Example.  Let  Xi, . . . ,  Xn  ~  N(9 ,  a2).  For  simplicity,  let  us  assume  that 
a  is  known.  Suppose  we  take  as  a  prior  6  ~  N(a,b2).  In  problem  1  in  the 
exercises  it  is  shown  that  the  posterior  for  6  is 


0\Xn  ~  iV(0,  r2) 


(11.7) 


where 


6  =  wX  +  (1  —  w)a, 

l 


w 


se^ 


_ L  I’ 

>2  \  U2 


1 


1 


se^ 


1 

+  52  > 


se2  1  b2 

and  se  =  cr/yXi  is  the  standard  error  of  the  mle  X.  This  is  another  example 
of  a  conjugate  prior.  Note  that  w  -T  1  and  r/ se  — ^  1  as  n  -T  00.  So,  for  large 
n,  the  posterior  is  approximately  N(9,  se2).  The  same  is  true  if  n  is  fixed  but 
b  -T  00,  which  corresponds  to  letting  the  prior  become  very  flat. 

Continuing  with  this  example,  let  us  find  C  =  (c,  d)  such  that  P(#  G 
C \Xn)  =  .95.  We  can  do  this  by  choosing  c  and  d  such  that  F {0  <  c\Xn )  = 
.025  and  F (Q  >  d\Xn)  =  .025.  So,  we  want  to  find  c  such  that 


P(0  <  c\Xn)  =  P  |  - — -  <  C  9 


T 


T 


X 


n 


c  —  6 

F  (  Z  <  -  I  =  .025. 


r 


We  know  that  F (Z  <  —1.96)  =  .025.  So, 


c  —  6 


1.96 


r 


implying  that  c  =  0— 1.96r.  By  similar  arguments,  d  =  0+ 1.96.  So  a  95  percent 
Bayesian  interval  is  6)±1.96r.  Since  0  ~  0  and  r  ~  se,  the  95  percent  Bayesian 
interval  is  approximated  by  9  d=  1.96  se  which  is  the  frequent ist  confidence 
interval.  ■ 
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11.3  Functions  of  Parameters 


How  do  we  make  inferences  about  a  function  r  =  g(0)7  Remember  in  Chapter 
3  we  solved  the  following  problem:  given  the  density  fx  for  X,  find  the  density 
for  Y  =  g(X).  We  now  simply  apply  the  same  reasoning.  The  posterior  CDF 
for  r  is 

H(r\xn)  =  ¥(g(8)  <  r\xn)  =  f  f(Q\xn)dQ 

J  A 

where  A  =  {0  :  g{6)  <  r}.  The  posterior  density  is  h{r\xn)  =  Hf(r\xn). 


11.3  Example.  Let  Xi, . . . ,  Xn  ^  Bernoulli(p)  and  f{p)  =  1  so  that  p|Xn 
Beta(s  +  1,  n  —  s  +  1)  with  5  =  Y^=i  V’  —  log (p/(l  —  p))-  Then 


H(p) \xn)  =  P(T  <  i/j\xn)  =  P  log 


P 


1  -P 


<  p> 


x 


n 


p  \  p  < 


,i> 


1  + 


X 


n 


f(p\xn)  dp 


0 


T(n  +  2) 


*e^/(l+e^) 


T(s  +  l)T(n  —  5  +  1) 


ps(  1  —  p)n  s  dp 


and 

h{p) \xn) 


=  H'(p> \xn) 


T(n  +  2)  - 

T(s  +  l)T(n  —  5  +  1) 

L  + 

T(n  +  2)  - 

T(s  +  l)T(n  —  s  +  1) 

L  + 

T(n  +  2)  - 

( 

T(s  +  l)T(n  —  s  +  1) 

\1  + 

1 


n—s  (  Q 


A 


1+e^ 


1  + 
1 


dp) 


n—s 


1 


1  +  J  \  1  + 

^  x  n— s+2 


1  + 


for  p)  G  M. 


11.4  Simulation 

The  posterior  can  often  be  approximated  by  simulation.  Suppose  we  draw 
6 1 , ,0b  ~  p(0\xn).  Then  a  histogram  of  #i, . . . ,  Ob  approximates  the  poste¬ 
rior  density  p(0 \xn).  An  approximation  to  the  posterior  mean  9n  ='E(0\xn)  is 
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B  1  @j  -  The  posterior  1  —a  interval  can  be  approximated  by  (6a/ 2,  #1-^/2) 

where  6a/ 2  is  the  a/2  sample  quantile  of  61, ...  ,6b- 

Once  we  have  a  sample  61,  ...,6b  from  f{6 \xn),  let  Ti  =  g{6f).  Then 
T\, . . . ,  tb  is  a  sample  from  f(r \xn).  This  avoids  the  need  to  do  any  analytical 
calculations.  Simulation  is  discussed  in  more  detail  in  Chapter  24. 

11.4  Example.  Consider  again  Example  11.3.  We  can  approximate  the  pos¬ 
terior  for  without  doing  any  calculus.  Here  are  the  steps: 

1.  Draw  Pi, ... ,  Pb  ~  Beta(s  +  1,  n  —  s  +  1). 

2.  Let  y. 'ji  =  log(P,/(l  —  P^)  for  i  =  1, . . . ,  B. 

Now  ^1,  -  -  -  ,^b  are  IID  draws  from  h(i/j \xn).  A  histogram  of  these  values 
provides  an  estimate  of  h(ip\xn).  u 

11.5  Large  Sample  Properties  of  Bayes’  Procedures 

In  the  Bernoulli  and  Normal  examples  we  saw  that  the  posterior  mean  was 
close  to  the  mle.  This  is  true  in  greater  generality. 

11.5  Theorem.  Let6n  be  the  mle  and  let  se  =  1  / \j nl(6n).  Under  appropriate 

regularity  conditions ,  the  posterior  is  approximately  Normal  with  mean  6n  and 
standard  deviation  se.  Hence ,  6n  ~  6n.  Also,  if  Cn  =  (6n  —  za/2se,  6n  +^a/2se) 
is  the  asymptotic  frequentist  1  —  a  confidence  interval,  then  Cn  is  also  an 
approximate  1  —  a  Bayesian  posterior  interval: 

P(<9  g  Cn\Xn)  1  -  a. 

There  is  also  a  Bayesian  delta  method.  Let  r  =  g(9).  Then 

T\Xn  «  N(t,  se2) 
where  r  =  g(9)  and  se  =  se  \gr{6)\. 

11.6  Flat  Priors,  Improper  Priors,  and 
“Noninformative”  Priors 

An  important  question  in  Bayesian  inference  is:  where  does  one  get  the  prior 
f(6)2  One  school  of  thought,  called  subjectivism  says  that  the  prior  should 
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reflect  our  subjective  opinion  about  6  (before  the  data  are  collected).  This  may 
be  possible  in  some  cases  but  is  impractical  in  complicated  problems  especially 
if  there  are  many  parameters.  Moreover,  injecting  subjective  opinion  into  the 
analysis  is  contrary  to  the  goal  of  making  scientific  inference  as  objective 
as  possible.  An  alternative  is  to  try  to  define  some  sort  of  “noninformative 
prior.”  An  obvious  candidate  for  a  noninformative  prior  is  to  use  a  flat  prior 
m  oc  constant. 

In  the  Bernoulli  example,  taking  f(p)  =  1  leads  to  p\Xn  ~  Beta(s  +  1,  n  — 
5  +  1)  as  we  saw  earlier,  which  seemed  very  reasonable.  But  unfettered  use  of 
flat  priors  raises  some  questions. 

Improper  Priors.  Let  X  ~  N(61a2)  with  a  known.  Suppose  we  adopt 
a  flat  prior  f{6)  oc  c  where  c  >  0  is  a  constant.  Note  that  f  f(6)d6  =  oo  so 
this  is  not  a  probability  density  in  the  usual  sense.  We  call  such  a  prior  an 
improper  prior.  Nonetheless,  we  can  still  formally  carry  out  Bayes’  theorem 
and  compute  the  posterior  density  by  multiplying  the  prior  and  the  likelihood: 
f(Q)  oc  Cn(Q)f(Q)  oc  Cn(Q).  This  gives  6\Xn  ^  N(X,cr2 /n)  and  the  resulting 
point  and  interval  estimators  agree  exactly  with  their  frequentist  counterparts. 
In  general,  improper  priors  are  not  a  problem  as  long  as  the  resulting  posterior 
is  a  well-defined  probability  distribution. 

Flat  Priors  are  Not  Invariant.  Let  X  ~  Bernoulli (p)  and  suppose  we 
use  the  flat  prior  f(p)  =  1.  This  flat  prior  presumably  represents  our  lack  of 
information  about  p  before  the  experiment.  Now  let  ip  =  log(p/(l  —p))-  This 
is  a  transformation  of  p  and  we  can  compute  the  resulting  distribution  for  ip, 
namely, 

eb 

(1  +  e^)2 

which  is  not  flat.  But  if  we  are  ignorant  about  p  then  we  are  also  ignorant 
about  ip  so  we  should  use  a  flat  prior  for  ip.  This  is  a  contradiction.  In  short, 
the  notion  of  a  flat  prior  is  not  well  defined  because  a  flat  prior  on  a  parameter 
does  not  imply  a  flat  prior  on  a  transformed  version  of  the  parameter.  Flat 
priors  are  not  transformation  invariant. 

Jeffreys’  Prior.  Jeffreys  came  up  with  a  rule  for  creating  priors.  The 
rule  is:  take 

m  oc  i(o)i/2 

where  1(6)  is  the  Fisher  information  function.  This  rule  turns  out  to  be  trans¬ 
formation  invariant.  There  are  various  reasons  for  thinking  that  this  prior 
might  be  a  useful  prior  but  we  will  not  go  into  details  here. 


11.7  Multiparameter  Problems 
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11.6  Example.  Consider  the  Bernoulli  (p)  model.  Recall  that 


i(P)  = 


i 


p(  i  -  p ) 


Jeffreys’  rule  says  to  use  the  prior 


f(p)  oc  VHp)  =P  1/2(!-p)  1/2 • 

This  is  a  Beta  (1/2, 1/2)  density.  This  is  very  close  to  a  uniform  density.  ■ 

In  a  multiparameter  problem,  the  Jeffreys’  prior  is  defined  to  be  f(0)  oc 
y/r\T{0)\  where  \A\  denotes  the  determinant  of  a  matrix  A  and  1(6)  is  the 
Fisher  information  matrix. 


11.7  Multiparameter  Problems 


Suppose  that  6  =  (6h, . . . ,  0P).  The  posterior  density  is  still  given  by 


f(0\xn)  oc  Cn(0)f(0). 


(n.8) 


The  question  now  arises  of  how  to  extract  inferences  about  one  parameter. 
The  key  is  to  find  the  marginal  posterior  density  for  the  parameter  of  interest. 
Suppose  we  want  to  make  inferences  about  6\.  The  marginal  posterior  for  0\ 
is 


f(0  i\xn) 


ep\xn)de2 


(11.9) 


In  practice,  it  might  not  be  feasible  to  do  this  integral.  Simulation  can  help. 
Draw  randomly  from  the  posterior: 


e\...,eB  ~mxn) 


where  the  superscripts  index  the  different  draws.  Each  03  is  a  vector  03  = 
. . . ,  0Jp).  Now  collect  together  the  first  component  of  each  draw: 


These  are  a  sample  from  f(6i\xn)  and  we  have  avoided  doing  any  integrals. 

11.7  Example  (Comparing  Two  Binomials).  Suppose  we  have  n\  control  pa¬ 
tients  and  772  treatment  patients  and  that  X\  control  patients  survive  while 
X2  treatment  patients  survive.  We  want  to  estimate  r  =  p(pi,p2)  =  P2  —  Pi- 
Then, 

X\  ~  Binomial (tT/i, pi )  and  X2  ^  Binomial(n2,p2). 
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if  f(Pi  ,£>2)  =  1,  the  posterior  is 

f(Pl,P2\xi,X2)  OCPi^l  ~  Pi)ni~Xlp22(l  ~  P2)n2~X 2- 
Notice  that  (pi,P2)  live  on  a  rectangle  (a  square,  actually)  and  that 

f{pi,P2\xi,X2)  =  f{pi\xi)f{p2\X2) 

where 

/(pi|xi)  ocpi^l  -pi)ni_:Cl  and  /(^2 ^2)  ocp^l  ~  P2)n2~X2 

which  implies  that  pi  and  P2  are  independent  under  the  posterior.  Also, 
Pi\xi  ~  Beta(xi  +  l,ni  —  xi  +  1)  and  p2\x2  ^  Beta(x2  +  l,n2  —  £2  +  1). 
If  we  simulate  Pg  1, . . . ,  P\,b  ™  Beta(#i  +  1,  n\  —  x\  +  1)  and  i^i,  •  •  • ,  P^,b  ~ 
Beta(x2  +  1,  n2  —  X2  +  1),  then  75  =  P^^  ~  Pi, 6,  b  =  1, . . . ,  P,  is  a  sample  from 
/(r|xi,x2).  ■ 


11.8  Bayesian  Testing 


Hypothesis  testing  from  a  Bayesian  point  of  view  is  a  complex  topic.  We 
will  only  give  a  brief  sketch  of  the  main  idea  here.  The  Bayesian  approach 
to  testing  involves  putting  a  prior  on  Hq  and  on  the  parameter  9  and  then 
computing  P(Po|^n)-  Consider  the  case  where  6  is  scalar  and  we  are  testing 

Hq  :  0  =  60  versus  H 1  :  6  7^  #0. 


It  is  usually  reasonable  to  use  the  prior  P(iJo)  =  IP(P 1)  =  1/2  (although  this 
is  not  essential  in  what  follows).  Under  H 1  we  need  a  prior  for  6.  Denote  this 
prior  density  by  f(9).  From  Bayes’  theorem 


¥(H0 \Xn  =  xn) 


f(xn\H0)F(H0) 

f(xn\H0)F(Ho)  +  /MWOff!) 

_ \f(xn  I  do) _ 

\f{xn  |  Oo)  +  \f{xn  I  HO 
_ f{xn  |  Op) _ 

/(a- |  e0 )  +  //(*"  |  e)f(e)de 

C(0o) 

C(0o)  +  f  C{e)f{6)d6' 


We  saw  that,  in  estimation  problems,  the  prior  was  not  very  influential  and 
that  the  frequentist  and  Bayesian  methods  gave  similar  answers.  This  is  not 
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the  case  in  hypothesis  testing.  Also,  one  can’t  use  improper  priors  in  testing 
because  this  leads  to  an  undefined  constant  in  the  denominator  of  the  expres¬ 
sion  above.  Thus,  if  you  use  Bayesian  testing  you  must  choose  the  prior  f{6) 
very  carefully.  It  is  possible  to  get  a  prior-free  bound  on  ¥(Ho\Xn  =  xn). 
Notice  that  0  <  f  C{Q)f(Q)dQ  <  C{0).  Hence, 


C(0O) 

C(0O)  +  C{0) 


<  ¥(H0\Xn  =  xn)  <  1. 


The  upper  bound  is  not  very  interesting,  but  the  lower  bound  is  non-trivial. 


11.9  Strengths  and  Weaknesses  of  Bayesian  Inference 


Bayesian  inference  is  appealing  when  prior  information  is  available  since  Bayes’ 
theorem  is  a  natural  way  to  combine  prior  information  with  data.  Some  peo¬ 
ple  find  Bayesian  inference  psychologically  appealing  because  it  allows  us  to 
make  probability  statements  about  parameters.  In  contrast,  frequentist  infer¬ 
ence  provides  confidence  sets  Cn  which  trap  the  parameter  95  percent  of  the 
time,  but  we  cannot  say  that  P(0  E  Cn\Xn)  is  .95.  In  the  frequentist  approach 
we  can  make  probability  statements  about  Cn,  not  6.  However,  psychological 
appeal  is  not  a  compelling  scientific  argument  for  using  one  type  of  inference 
over  another. 

In  parametric  models,  with  large  samples,  Bayesian  and  frequentist  methods 
give  approximately  the  same  inferences.  In  general,  they  need  not  agree. 

Here  are  three  examples  that  illustrate  the  strengths  and  weakness  of  Bayesian 
inference.  The  first  example  is  Example  6.14  revisited.  This  example  shows 
the  psychological  appeal  of  Bayesian  inference.  The  second  and  third  show 
that  Bayesian  methods  can  fail. 

11.8  Example  (Example  6.14  revisited).  We  begin  by  reviewing  the  example. 
Let  9  be  a  fixed,  known  real  number  and  let  Xi,X2  be  independent  random 
variables  such  that  P(A}  =  1)  =  IP  (A}  =  —1)  =  1/2.  Now  define  Yi  =  6  +  Xi 
and  suppose  that  you  only  observe  Y\  and  Y2.  Let 


C 


{Yi  -  1}  if  Yi  =  Y2 

{(u  +  u)/2}  if  n  7^2. 


This  is  a  75  percent  confidence  set  since,  no  matter  what  0  is,  ¥0(6  E  C)  =  3/4. 

Suppose  we  observe  Y\  =  15  and  Y2  =  17.  Then  our  75  percent  confidence 
interval  is  {16}.  However,  we  are  certain,  in  this  case,  that  6  =  16.  So  calling 
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this  a  75  percent  confidence  set,  bothers  many  people.  Nonetheless,  C  is  a 
valid  75  percent  confidence  set.  It  will  trap  the  true  value  75  percent  of  the 
time. 

The  Bayesian  solution  is  more  satisfying  to  many.  For  simplicity,  assume 
that  6  is  an  integer.  Let  f{6)  be  a  prior  mass  function  such  that  f{0)  >  0  for 
every  integer  0.  When  Y  =  (Yi,  >2)  =  (15, 17),  the  likelihood  function  is 


1/4  <9  =  16 
0  otherwise. 


Applying  Bayes’  theorem  we  see  that 


P(©  =  6\Y  =  (15,17)) 


1  <9  =  16 

0  otherwise. 


Hence,  P( 6  E  C\Y  =  (15, 17))  =  1.  There  is  nothing  wrong  with  saying  that 
{16}  is  a  75  percent  confidence  interval.  But  is  it  not  a  probability  statement 
about  0.  u 


11.9  Example.  This  is  a  simplified  version  of  the  example  in  Robins  and  Ritov 
(1997).  The  data  consist  of  n  HD  triples 

(Xi,Ri,Yi), . . . ,  (Xn,  Yn,  Rn). 

Let  B  be  a  finite  but  very  large  number,  like  B  =  IOO100.  Any  realistic  sample 
size  n  will  be  small  compared  to  B.  Let 

0  =  (6fi, . . . ,  Ob) 

be  a  vector  of  unknown  parameters  such  that  0  <  Oj  <  1  for  1  <  j  <  B.  Let 

£  =  (£i>  •  •  •  >£#) 

be  a  vector  of  known  numbers  such  that 


0  <  5  <  ^  <  1  -  5  <  1,  1  <  j  <  B, 

where  S  is  some,  small,  positive  number.  Each  data  point  (X^,  Ri,Yi)  is  drawn 
in  the  following  way: 

1.  Draw  Xi  uniformly  from  {1, . . . ,  B}. 

2.  Draw  Ri  ^  Bernoulli (<fxj- 

3.  If  Ri  =  1,  then  draw  Yi  ~  Bernoulli(6>xJ-  If  Ri  =  0,  do  not  draw  Y^. 
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The  model  may  seem  a  little  artificial  but,  in  fact,  it  is  caricature  of  some 
real  missing  data  problems  in  which  some  data  points  are  not  observed.  In 
this  example,  Ri  =  0  can  be  thought  of  as  meaning  “missing.”  Our  goal  is  to 
estimate 

i>  =  =  i). 


Note  that 


B 


tP  =  P(Y*  =  1)  =  ^P(X  =  1\X  =  j)V(X  =  j) 

3  = 1 

i  B 


= 


3  = 1 


so  ip  =  g{6)  is  a  function  of  6. 

Let  us  consider  a  Bayesian  analysis  first.  The  likelihood  of  a  single  obser¬ 
vation  is 

f{XuRi,Yi)  =  f(Xi)f(Ri\Xi)f(Yi\Xi)^. 

The  last  term  is  raised  to  the  power  Ri  since,  if  Ri  =  0,  then  Yi  is  not  observed 
and  hence  that  term  drops  out  of  the  likelihood.  Since  f(Xi)  =  1/B  and  that 
Yi  and  Ri  are  Bernoulli, 

/(vvoRiivvor \Xi)Ri  =  -1  (i  e) 

Thus,  the  likelihood  function  is 


n 


m  =  n  f(Xt)f(Rt\Xt)f(Yt\Xt) 


Ri 


i= 1 
n 


l 


n  B  dfi  (!  -  Zxt)1  R “  QYxRx  (1  -  #x;)(1  Yi)Ri 


i= 1 


(X 


-  eXi)(1~Y')R' ■ 


We  have  dropped  all  the  terms  involving  B  and  the  ^-’s  since  these  are  known 
constants,  not  parameters.  The  log-likelihood  is 

n 

m  =  YY*R'X  og^xt +  (1-^)^  iog(i-^j 

i=  1 

B  B 

=  N  V  loS  6 3  +  Y  mi  los(1  -  6 i ) 
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where 


rij  =  #{i  ■■  Yi  =  l,Ri  =  1  ,Xi  =  j} 
nij  =  :  Yi  =  0,  Ri  =  1,  =  j}. 


Now,  rij  =  nij  =  0  for  most  j  since  B  is  so  much  larger  than  n.  This  has 
several  implications.  First,  the  mle  for  most  9j  is  not  defined.  Second,  for 
most  6j ,  the  posterior  distribution  is  equal  to  the  prior  distribution,  since 
those  0j  do  not  appear  in  the  likelihood.  Hence,  f(0 |Data)  ~  /(#).  It  follows 
that  /(V> |  Data)  ~  /(VO-  In  °ther  words,  the  data  provide  little  information 
about  V>  in  a  Bayesian  analysis. 

Now  we  consider  a  frequentist  solution.  Define 


(11.10) 


We  will  now  show  that  this  estimator  is  unbiased  and  has  small  mean-squared 
error.  It  can  be  shown  (see  Exercise  7)  that 

E (VO  =  'V  and  Yfi/j)  <  — — .  (11.11) 

Therefore,  the  MSE  is  of  order  1  jn  which  goes  to  0  fairly  quickly  as  we  collect 
more  data,  no  matter  how  large  B  is.  The  estimator  defined  in  (11.10)  is  called 
the  Horwitz-Thompson  estimator.  It  cannot  be  derived  from  a  Bayesian  or 
likelihood  point  of  view  since  it  involves  the  terms  These  terms  drop 
out  of  the  log-likelihood  and  hence  will  not  show  up  in  any  likelihood-based 
method  including  Bayesian  estimators. 

The  moral  of  the  story  is  this.  Bayesian  methods  are  tied  to  the  likeli¬ 
hood  function.  But  in  high  dimensional  (and  nonparametric)  problems,  the 
likelihood  may  not  yield  accurate  inferences.  ■ 


11.10  Example.  Suppose  that  /  is  a  probability  density  function  and  that 


f(x)  =  cg{x) 

where  g{x)  >  0  is  a  known  function  and  c  is  unknown.  In  principle  we  can 
compute  c  since  f  f(x)  dx  —  1  implies  that  c  =  1/  f  g(x)dx.  But  in  many  cases 
we  can’t  do  the  integral  f  g(x)  dx  since  g  might  be  a  complicated  function  and 
x  could  be  high  dimensional.  Despite  the  fact  that  c  is  not  known,  it  is  often 
possible  to  draw  a  sample  Xi, . . . ,  Xn  from  /;  see  Chapter  24.  Can  we  use  the 
sample  to  estimate  the  normalizing  constant  cl  Here  is  a  frequentist  solution: 


11.10  Bibliographic  Remarks 
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Let  fn(x )  be  a  consistent  estimate  of  the  density  /.  Chapter  20  explains  how  to 
construct  such  an  estimate.  Choose  any  point  x  and  note  that  c  =  f(x)/g{x). 
Hence,  c  =  f(x)/g(x)  is  a  consistent  estimate  of  c.  Now  let  us  try  to  solve  this 
problem  from  a  Bayesian  approach.  Let  i r(c)  be  a  prior  such  that  i r  (c)  >  0  for 
all  c  >  0.  The  likelihood  function  is 

n  n  n 

cn(c) = n  f(Xi) = n  cg(Xi) = c"  n  <?(*<) « <*. 

i= 1  i=l  i=l 

Hence  the  posterior  is  proportional  to  cn7r(c).  The  posterior  does  not  depend 
on  Xi, . . . ,  Xn,  so  we  come  to  the  startling  conclusion  that,  from  the  Bayesian 
point  of  view,  there  is  no  information  in  the  data  about  c.  Moreover,  the 
posterior  mean  is 

J0°°  cn+17r(c)  dc 
fo°°  °n7r(c )  dc 

which  tends  to  infinity  as  n  increases.  ■ 

These  last  two  examples  illustrate  an  important  point.  Bayesians  are  slaves 
to  the  likelihood  function.  When  the  likelihood  goes  awry,  so  will  Bayesian 
inference. 

What  should  we  conclude  from  all  this?  The  important  thing  is  to  under¬ 
stand  that  frequent ist  and  Bayesian  methods  are  answering  different  ques¬ 
tions.  To  combine  prior  beliefs  with  data  in  a  principled  way,  use  Bayesian  in¬ 
ference.  To  construct  procedures  with  guaranteed  long  run  performance,  such 
as  confidence  intervals,  use  frequentist  methods.  Generally,  Bayesian  methods 
run  into  problems  when  the  parameter  space  is  high  dimensional.  In  particu¬ 
lar,  95  percent  posterior  intervals  need  not  contain  the  true  value  95  percent 
of  the  time  (in  the  frequency  sense). 


11.10  Bibliographic  Remarks 

Some  references  on  Bayesian  inference  include  Carlin  and  Louis  (1996),  Gel- 
man  et  al.  (1995),  Lee  (1997),  Robert  (1994),  and  Schervish  (1995).  See  Cox 
(1993),  Diaconis  and  Freedman  (1986),  Freedman  (1999),  Barron  et  al.  (1999), 
Ghosal  et  al.  (2000),  Shen  and  Wasserman  (2001),  and  Zhao  (2000)  for  discus¬ 
sions  of  some  of  the  technicalities  of  nonparametric  Bayesian  inference.  The 
Robins- Ritov  example  is  discussed  in  detail  in  Robins  and  Ritov  (1997)  where 
it  is  cast  more  properly  as  a  nonparametric  problem.  Example  11.10  is  due  to 
Edward  George  (personal  communication).  See  Berger  and  Delampady  (1987) 
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and  Kass  and  Raftery  (1995)  for  a  discussion  of  Bayesian  testing.  See  Kass 
and  Wasserman  (1996)  for  a  discussion  of  noninformative  priors. 


11.11  Appendix 


Proof  of  Theorem  11.5. 

It  can  be  shown  that  the  effect  of  the  prior  diminishes  as  n  increases  so 
that  f(6\Xn)  oc  Cn(0)f(0)  «  Cn(0).  Hence,  log/(<9|Xn)  «  £(0).  Now,  £(0)  « 
t(6)  +  (6  -  6)t'(Q)  +  [(0  -  6)2/2]i"(6)  =  l(e)  +  [(0  -  ef/2 ]£"(§)  since  i'(Q)  =  0. 
Exponentiating,  we  get  approximately  that 


f(9\Xn)  oc  exp 


1  (6>  -  ef 

2  <r2 

n 


where  <r^  =  —l/£"(0n).  So  the  posterior  of  0  is  approximately  Normal  with 
mean  9  and  variance  Let  £i  =  log  f(Xi\8),  then 


1 


<7 


2 

n 


and  hence  an  ~  se(9). 


-r  (A)  =  £  -£"(0n) 

i 

n(l)^-A(A)«nE0 

^  '  i 

nl{9n) 


11.12  Exercises 

1.  Verify  (11.7). 

2.  Let  Xi,  ...,Xn  ~  Normal(/i,  1). 

(a)  Simulate  a  data  set  (using  fi  =  5)  consisting  of  n=100  observations. 

(b)  Take  /(/x)  =  1  and  hnd  the  posterior  density.  Plot  the  density. 

(c)  Simulate  1,000  draws  from  the  posterior.  Plot  a  histogram  of  the 
simulated  values  and  compare  the  histogram  to  the  answer  in  (b). 

(d)  Let  6  =  .  Find  the  posterior  density  for  6  analytically  and  by 

simulation. 

(e)  Find  a  95  percent  posterior  interval  for  /i. 

(f)  Find  a  95  percent  confidence  interval  for  0. 
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3.  Let  Xi,...,Xn  ^  Uniform(0, 6).  Let  f(6)  oc  1/9.  Find  the  posterior 
density. 


4.  Suppose  that  50  people  are  given  a  placebo  and  50  are  given  a  new 
treatment.  30  placebo  patients  show  improvement  while  40  treated  pa¬ 
tients  show  improvement.  Let  r  =  p2  —  Pi  where  p2  is  the  probability  of 
improving  under  treatment  and  pi  is  the  probability  of  improving  under 
placebo. 

(a)  Find  the  mle  of  r.  Find  the  standard  error  and  90  percent  confidence 
interval  using  the  delta  method. 

(b)  Find  the  standard  error  and  90  percent  confidence  interval  using  the 
parametric  bootstrap. 


(c)  Use  the  prior  f(pi,P2)  =  1-  Use  simulation  to  find  the  posterior 
mean  and  posterior  90  percent  interval  for  r. 


(d)  Let 


ip  =  log 


Pi 


P2 


1  ~  Pi 


1  P2 


be  the  log-odds  ratio.  Note  that  ip  =  0  if  p\  =  P2-  Find  the  mle  of  ip. 
Use  the  delta  method  to  find  a  90  percent  confidence  interval  for  ip. 


(e)  Use  simulation  to  find  the  posterior  mean  and  posterior  90  percent 
interval  for  ip. 


5.  Consider  the  Bernoulli(p)  observations 
0101000000 

Plot  the  posterior  for  p  using  these  priors:  Beta(l/2,l/2),  Beta(l,l), 
Beta(10,10),  Beta(100,100). 

6.  Let  Xi, . . . ,  Xn  ~  Poisson(A). 

(a)  Let  A  ^  Gamma(a,  (3)  be  the  prior.  Show  that  the  posterior  is  also 
a  Gamma.  Find  the  posterior  mean. 

(b)  Find  the  Jeffreys’  prior.  Find  the  posterior. 

7.  In  Example  11.9,  verify  (11.11). 

8.  Let  X  iV(/i,  1).  Consider  testing 


Hq  :  fi  =  0  versus  H i  :  /i  ^  0. 
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Take  P(iLo)  =  IP(-H’i)  =  1/2.  Let  the  prior  for  /x  under  Hi  be  /i  ^ 
N(0,b2).  Find  an  expression  for  P(iLo|Y  =  x).  Compare  ¥(Hq\X  =  x) 
to  the  p- value  of  the  Wald  test.  Do  the  comparison  numerically  for  a 
variety  of  values  of  x  and  b.  Now  repeat  the  problem  using  a  sample  of 
size  n.  You  will  see  that  the  posterior  probability  of  Ho  can  be  large  even 
when  the  p- value  is  small,  especially  when  n  is  large.  This  disagreement 
between  Bayesian  and  frequentist  testing  is  called  the  Jeffreys-Lindley 
paradox. 


12 

Statistical  Decision  Theory 


12.1  Preliminaries 

We  have  considered  several  point  estimators  such  as  the  maximum  likelihood 
estimator,  the  method  of  moments  estimator,  and  the  posterior  mean.  In  fact, 
there  are  many  other  ways  to  generate  estimators.  How  do  we  choose  among 
them?  The  answer  is  found  in  decision  theory  which  is  a  formal  theory  for 
comparing  statistical  procedures. 

Consider  a  parameter  9  which  lives  in  a  parameter  space  0.  Let  9  be  an 
estimator  of  9.  In  the  language  of  decision  theory,  an  estimator  is  sometimes 
called  a  decision  rule  and  the  possible  values  of  the  decision  rule  are  called 

actions. 

We  shall  measure  the  discrepancy  between  9  and  9  using  a  loss  function 
L(9,9).  Formally,  L  maps  ©  x  ©  into  R.  Here  are  some  examples  of  loss 
functions: 


me) 

L(9, 9) 
L(9, 9) 
L(9, 9) 
L(9, 9) 


{9  -  9)2 
\9~9\ 

9-9\p 

0  if  9  =  9  or  1  if  9  ^  9 

/1os(7(pS)/(;e;  e)dx 


squared  error  loss, 
absolute  error  loss, 

Lp  loss, 
zero-one  loss, 

Kullback-Leibler  loss. 
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Bear  in  mind  in  what  follows  that  an  estimator  0  is  a  function  of  the  data. 
To  emphasize  this  point,  sometimes  we  will  write  0  as  0(X).  To  assess  an 
estimator,  we  evaluate  the  average  loss  or  risk. 


12.1  Definition.  The  risk  of  an  estimator  6  is 


R(0,0)  =E  o\L{Q,Q) 


L(0 , 0(x))f(x;  6)dx. 


When  the  loss  function  is  squared  error,  the  risk  is  just  the  MSE  (mean 
squared  error): 

R(0, 9)  =  Ee(6-  Of  =  MSE  =  ye(6)  +  bias |(0). 

In  the  rest  of  the  chapter,  if  we  do  not  state  what  loss  function  we  are  using, 
assume  the  loss  function  is  squared  error. 

12.2  Comparing  Risk  Functions 

To  compare  two  estimators  we  can  compare  their  risk  functions.  However,  this 
does  not  provide  a  clear  answer  as  to  which  estimator  is  better.  Consider  the 
following  examples. 

12.2  Example.  Let  X  N (0,1)  and  assume  we  are  using  squared  error 

loss.  Consider  two  estimators:  0\  =  X  and  62  =  3.  The  risk  functions  are 
R(0 ,  )  =  Ee(X  -  Of  =  1  and  R(0 , 02)  =  Ee(3  -  0)2  =  (3  -  0)2 .  If  2  <  0  <  4 

then  R(0,02)  <  R(0,0\),  otherwise,  R(0,0\)  <  R(0,02 ).  Neither  estimator 
uniformly  dominates  the  other;  see  Figure  12.1.  ■ 

12.3  Example.  Let  X\, . . .  ,Xn  ~  Bernoulli(p).  Consider  squared  error  loss 
and  let  pi  =  X.  Since  this  has  0  bias,  we  have  that 

R(p,Pi)=Y(X)  =  P{1~P). 

n 

Another  estimator  is 

P 2  = 

ex  fj  ~\~  n 

where  Y  =  XlILi  ^  an(^  a  an(^  P  are  positive  constants.  This  is  the  posterior 
mean  using  a  Beta  (a,  (3)  prior.  Now, 

R(P,P2)  =  Vp(p2)  +  (biasp(p2))2 


12.2  Comparing  Risk  Functions 


195 


0  1  2  3  4  5 


FIGURE  12.1.  Comparing  two  risk  functions.  Neither  risk  function  dominates  the 
other  at  all  values  of  0. 


Y  +  a  )  +  (Ep  (  Y  +  a 

ol  -\-  (3  -\-  Tt  J  V  \cx-\-/3-\-n 


np(l  —  p)  (  np  +  a  \ 

(a  +  P  +  n)2  ^\ce  +  /3  +  n  / 

Let  a  =  f3  =  \/n/ 4.  (In  Example  12.12  we  will  explain  this  choice.)  The 
resulting  estimator  is 

_  Y  +  a/  n/4 


^2  = 


n  +  vn 


and  the  risk  function  is 


77/ 

R(jp^P2)  ~  2  * 

4(77/  +  yTi)z 

The  risk  functions  are  plotted  in  figure  12.2.  As  we  can  see,  neither  estimator 
uniformly  dominates  the  other. 


These  examples  highlight  the  need  to  be  able  to  compare  risk  functions. 
To  do  so,  we  need  a  one-number  summary  of  the  risk  function.  Two  such 
summaries  are  the  maximum  risk  and  the  Bayes  risk. 


12.4  Definition.  The  maximum  risk  is 


R{6)  =  sup  R(0 , 6) 

e 


(12.1) 


and  the  Bayes  risk  is 


r(f,0)=  /  R(6,6)f(6)d6 


(12.2) 


where  f(6)  is  a  prior  for  6. 
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FIGURE  12.2.  Risk  functions  for  pi  and  p2  in  Example  12.3.  The  solid  curve  is 
R(pi).  The  dotted  line  is  R{p2)> 


12.5  Example.  Consider  again  the  two  estimators  in  Example  12.3.  We  have 


max 

0<P<1 


p(  i  -p) 

n 


1 

4n 


and 

_  ji  rfi 

R(P2)  =  max  — - ; - =  — - ; - -=r7r. 

p  4  (n  +  y/n)2  4  (n  +  y/n)2 

Based  on  maximum  risk,  p2  is  a  better  estimator  since  R(p2)  <  R(pi)-  How¬ 
ever,  when  n  is  large,  R(pi)  has  smaller  risk  except  for  a  small  region  in  the 
parameter  space  near  p  =  1/2.  Thus,  many  people  prefer  pi  to  p2-  This  il¬ 
lustrates  that  one- number  summaries  like  maximum  risk  are  imperfect.  Now 
consider  the  Bayes  risk.  For  illustration,  let  us  take  f(p)  =  1.  Then 

r(f,Pi)=  f  R(p,Pi)dp=  I  P<yl n  P^dp=  T 


and 

r(f,P2)  = 

For  n  >  20,  r(/,p2)  >  r(/,pi)  which  suggests  that  p\  is  a  better  estimator. 
This  might  seem  intuitively  reasonable  but  this  answer  depends  on  the  choice 
of  prior.  The  advantage  of  using  maximum  risk,  despite  its  problems,  is  that 
it  does  not  require  one  to  choose  a  prior.  ■ 


R(p,p2)dp  = 


n 


4  (n  +  \/n)2 


These  two  summaries  of  the  risk  function  suggest  two  different  methods 
for  devising  estimators:  choosing  0  to  minimize  the  maximum  risk  leads  to 


12.3  Bayes  Estimators 
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minimax  estimators;  choosing  0  to  minimize  the  Bayes  risk  leads  to  Bayes 
estimators. 

12.6  Definition.  A  decision  rule  that  minimizes  the  Bayes  risk  is  called  a 
Bayes  rule.  Formally ,  9  is  a  Bayes  rule  with  respect  to  the  prior  f  if 

r(f,0)  =  inf  r  (/,<?)  (12.3) 

e 

where  the  infimum  is  over  all  estimators  6.  An  estimator  that  minimizes 
the  maximum  risk  is  called  a  minimax  rule.  Formally,  6  is  minimax  if 

sup  R{0 ,  6)  =  inf  sup  R{0,  0)  (12.4) 

o  e  e 

where  the  infimum  is  over  all  estimators  6. 


12.3  Bayes  Estimators 


(12.5) 


Let  /  be  a  prior.  From  Bayes’  theorem,  the  posterior  density  is 

,mx.  =  tmm  =  nmm  ,,,  5) 

m(x)  f  /(#|0)/(0)d0 

where  m(x)  =  f  f(x,  6)d0  =  //(  x\9)f  (9)d9  is  the  marginal  distribution  of 
X.  Define  the  posterior  risk  of  an  estimator  6{x)  by 


r(6\x)  =  /  L(9,  6{x))f(6\x)d6. 


(!2.6) 


12.7  Theorem.  The  Bayes  risk  r(f,0)  satisfies 

r(f,0)  =  f  r(0\x)m(x)  dx. 


Let  6(x)  be  the  value  of  6  that  minimizes  r(6\x).  Then  6  is  the  Bayes  estimator. 
Proof.  We  can  rewrite  the  Bayes  risk  as  follows: 


r(f,  0) 


R(0, 8)f(0)d0 


L(0, 8(x))f(x\6)dx  f(0)d0 


L{6 ,  0(x))f(x ,  6)dxd6 


L{6 ,  0(x))  f  (0\x)m(x)dxd0 


L(6,0(x))f(6\x)d6  m(x)  dx  =  /  r(6\x)m(x)  dx. 
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If  we  choose  9{x)  to  be  the  value  of  9  that  minimizes  r(6\x)  then  we  will  mini¬ 
mize  the  integrand  at  every  x  and  thus  minimize  the  integral  f  r(9\x)m(x)dx. 


Now  we  can  find  an  explicit  formula  for  the  Bayes  estimator  for  some  specific 
loss  functions. 

12.8  Theorem.  If  L(9,  9)  =  (9  —  9)2  then  the  Bayes  estimator  is 

6(x)  =  [  9f(9\x)d9  =  E(9\X  =  x).  (12.7) 


If  L(9,  6)  =  \8  —  8\  then  the  Bayes  estimator  is  the  median  of  the  posterior 
m  x).  If  L(9,  9)  is  zero-one  loss ,  then  the  Bayes  estimator  is  the  mode  of  the 
posterior  f(6\x). 


Proof.  We  will  prove  the  theorem  for  squared  error  loss.  The  Bayes  rule 
6{x)  minimizes  r{6 \x)  =  f  (9  —  9{x))2  f  {9\x)d9 .  Taking  the  derivative  of  r{9 \x) 
with  respect  to  9{x)  and  setting  it  equal  to  0  yields  the  equation  2  f  (9  — 
9(x))f(9\x)d9  =  0.  Solving  for  9(x)  we  get  12.7.  ■ 


12.9  Example.  Let  Xi, . . .  ,Xn  ~  N(p,cr2)  where  a2  is  known.  Suppose  we 
use  a  iV(a,  b2)  prior  for  p.  The  Bayes  estimator  with  respect  to  squared  error 
loss  is  the  posterior  mean,  which  is 


a.  u 


12.4  Minimax  Rules 


Finding  minimax  rules  is  complicated  and  we  cannot  attempt  a  complete 
coverage  of  that  theory  here  but  we  will  mention  a  few  key  results.  The  main 
message  to  take  away  from  this  section  is:  Bayes  estimators  with  a  constant 
risk  function  are  minimax. 

12.10  Theorem.  Let  9 f  be  the  Bayes  rule  for  some  prior  f: 

r(f,  9f)  =  inf  r(f,  6).  (12.8) 

0 


Suppose  that 

R{9,ef)  <r(f,ef)  for  all  9. 

Then  9?  is  minimax  and  f  is  called  a  least  favorable  prior. 


(12.9) 


12.4  Minimax  Rules 


199 


Proof.  Suppose  that  9 f  is  not  minimax.  Then  there  is  another  rule  6$  such 
that  sup#  R(9,  Oq)  <  sup eR(Q,Qf).  Since  the  average  of  a  function  is  always 
less  than  or  equal  to  its  maximum,  we  have  that  r(/,  6q)  <  sup#  R(8,  9q). 
Hence, 

r(f,0o)  <  sup  R(0,6o)  <  sup  R(6,6f)  <  r(f,9f) 

0  e 

which  contradicts  (12.8).  ■ 


12.11  Theorem.  Suppose  that  0  is  the  Bayes  rule  with  respect  to  some 
prior  f .  Suppose  further  that  9  has  constant  risk:  R(9,  9)  =  c  for  some  c. 
Then  9  is  minimax. 


Proof.  The  Bayes  risk  is  r(/,  9)  =  f  R(9 ,  9)f(9)d9  =  c  and  hence  R{0,  9)  < 
r(/,  9)  for  all  0.  Now  apply  the  previous  theorem.  ■ 

12.12  Example.  Consider  the  Bernoulli  model  with  squared  error  loss.  In 
example  12.3  we  showed  that  the  estimator 


p(Xn)  = 


has  a  constant  risk  function.  This  estimator  is  the  posterior  mean,  and  hence 
the  Bayes  rule,  for  the  prior  Beta(cq  (3)  with  a  =  f3  =  y/n/ 4.  Hence,  by  the 
previous  theorem,  this  estimator  is  minimax.  ■ 

12.13  Example.  Consider  again  the  Bernoulli  but  with  loss  function 


i  Xi  +  vn 


n  +  \fn 


L(p,p)  = 


( P  ~  P )2 
P(  1  -  P) 


Let 


p(Xn)  =p  = 


E”=i  V 


The  risk  is 


R{p,p)  =  E 


(P~P)2\  =  1 

P(1-P)J  P^-P) 


p{  1  ~P) 


which,  as  a  function  of  p,  is  constant.  It  can  be  shown  that,  for  this  loss 
function,  p(Xn)  is  the  Bayes  estimator  under  the  prior  f(p)  =  1.  Hence,  p  is 


mimmax. 


A  natural  question  to  ask  is:  what  is  the  minimax  estimator  for  a  Normal 
model? 
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FIGURE  12.3.  Risk  function  for  constrained  Normal  with  m=.5.  The  two  short 
dashed  lines  show  the  least  favorable  prior  which  puts  its  mass  at  two  points. 

12.14  Theorem.  Let  Xi, . . . ,  Xn  ^  N{6 , 1)  and  let  0  =  X .  Then  0  is  minimax 
with  respect  to  any  well-behaved  loss  function.  1  It  is  the  only  estimator  with 
this  property. 

If  the  parameter  space  is  restricted,  then  the  theorem  above  does  not  apply 
as  the  next  example  shows. 

12.15  Example.  Suppose  that  X  ^  N(0, 1)  and  that  0  is  known  to  lie  in  the 
interval  [— m,  m]  where  0  <  m  <  1.  The  unique,  minimax  estimator  under 
squared  error  loss  is 

0(X)  =  mtanh(mX) 

where  tanh(z)  =  ( ez  —  e~z)/{ez  +  e~z).  It  can  be  shown  that  this  is  the  Bayes 
rule  with  respect  to  the  prior  that  puts  mass  1/2  at  m  and  mass  1/2  at  —m. 
Moreover,  it  can  be  shown  that  the  risk  is  not  constant  but  it  does  satisfy 
R(0,  0)  <  r(/,  0)  for  all  0 ;  see  Figure  12.3.  Hence,  Theorem  12.10  implies  that 
6  is  minimax.  ■ 


1  “Well-behaved”  means  that  the  level  sets  must  be  convex  and  symmetric  about  the  origin. 
The  result  holds  up  to  sets  of  measure  0. 


12.5  Maximum  Likelihood,  Minimax,  and  Bayes 
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12.5  Maximum  Likelihood,  Minimax,  and  Bayes 


For  parametric  models  that  satisfy  weak  regularity  conditions,  the  maximum 
likelihood  estimator  is  approximately  minimax.  Consider  squared  error  loss 
which  is  squared  bias  plus  variance.  In  parametric  models  with  large  samples, 
it  can  be  shown  that  the  variance  term  dominates  the  bias  so  the  risk  of  the 
MLE  0  roughly  equals  the  variance:2 

R{9, 6)  =  V e(6)  +  bias2  «  Ye(0). 

As  we  saw  in  Chapter  9,  the  variance  of  the  MLE  is  approximately 

1 

n 

where  1(0)  is  the  Fisher  information.  Hence, 

(12.10) 

For  any  other  estimator  0r ,  it  can  be  shown  that  for  large  n,  R(0 ,  O')  >  i?(0,  0). 
More  precisely, 

lim  lim  sup  sup  n 

e— n— >-oo  |0— 0'|<e 

This  says  that,  in  a  local,  large  sample  sense,  the  MLE  is  minimax.  It  can  also 
be  shown  that  the  MLE  is  approximately  the  Bayes  rule. 

In  summary: 


R{e’,e)>—  (i2.ii) 


In  most  parametric  models,  with  large  samples,  the  mle  is  approxi¬ 
mately  minimax  and  Bayes. 

There  is  a  caveat:  these  results  break  down  when  the  number  of  parameters 
is  large  as  the  next  example  shows. 

12.16  Example  (Many  Normal  means).  Let  Yi  N(6i,  a2 /n),  i  =  l,...,n. 
Let  Y  =  (Yi, . . . ,  Yn)  denote  the  data  and  let  0  =  (#i, ... ,  0n)  denote  the 
unknown  parameters.  Assume  that 

{n 

(01,  •  •  •  >0n)  :  Oj  <  c2 

i=  1 


Typically,  the  squared  bias  is  order  0(n  2)  while  the  variance  is  of  order  0(n  1). 


-l 
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for  some  c  >  0.  In  this  model,  there  are  as  many  parameters  as  observations.  3 
The  MLE  is  9  =  Y  =  (Yl, . . . ,  Yn).  Under  the  loss  function  L(0,  9)  =  Y^i=i 
9i)2 ,  the  risk  of  the  MLE  is  R(8,  0)  =  a2.  It  can  be  shown  that  the  minimax  risk 
is  approximately  a2 / (a2  +  c2)  and  one  can  find  an  estimator  0  that  achieves 
this  risk.  Since  cr2/ (a2  +  c2)  <  cr2,  we  see  that  9  has  smaller  risk  than  the  MLE. 
In  practice,  the  difference  between  the  risks  can  be  substantial.  This  shows 
that  maximum  likelihood  is  not  an  optimal  estimator  in  high  dimensional 
problems.  ■ 


12.6  Admissibility 

Minimax  estimators  and  Bayes  estimators  are  “good  estimators”  in  the  sense 
that  they  have  small  risk.  It  is  also  useful  to  characterize  bad  estimators. 


12.17  Definition.  An  estimator  9  is  inadmissible  if  there  exists  another 
rule  9 '  such  that 

R(9,  9')  <  R(9 ,  9)  for  all  9  and 
R{9 ,  9r)  <  R{9 ,  9)  for  at  least  one  9. 

Otherwise ,  9  is  admissible. 


12.18  Example.  Let  X  iV(0, 1)  and  consider  estimating  9  with  squared 
error  loss.  Let  6(X)  =  3.  We  will  show  that  9  is  admissible.  Suppose  not. 
Then  there  exists  a  different  rule  9r  with  smaller  risk.  In  particular,  R( 3,  9r)  < 
R(3,9)  =  0.  Hence,  0  =  R(3,9')  =  f(9'(x)  —  3 )2f(x;  3 )dx.  Thus,  9'(x)  =  3. 
So  there  is  no  rule  that  beats  6.  Even  though  9  is  admissible  it  is  clearly  a 
bad  decision  rule.  ■ 

12.19  Theorem  (Bayes  Rules  Are  Admissible).  Suppose  that  ©  c  M  and  that 

R(9 ,  9)  is  a  continuous  function  of  9  for  every  9.  Let  f  be  a  prior  density  with 

0~\~6 

full  support ,  meaning  that ,  for  every  9  and  every  e  >  0,  feZ  f(0)d0  >  0.  Let 
9 f  be  the  Bayes  ’  rule .  If  the  Bayes  risk  is  finite  then  9 f  is  admissible. 

Proof.  Suppose  9 f  is  inadmissible.  Then  there  exists  a  better  rule  9  such 
that  R(Q,9)  <  R(Q,9f)  for  all  9  and  R(9q,9)  <  R(9q,9^)  for  some  8q.  Let 


3The  many  Normal  means  problem  is  more  general  than  it  looks.  Many  nonparametric  esti¬ 
mation  problems  are  mathematically  equivalent  to  this  model. 
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^  =  R(0o,9f)-R(0o,9)  ^ >  0.  Since  R  is  continuous,  there  is  an  c  0  such 
that  R{9 ,  0?)  —  R{9 ,  9)  >  z//2  for  all  9  (9q  —  e,  #o  +  e).  Now, 


r{f,0f)  ~  r(f,9) 


R(0,0f)f(0)d0  -  /  R(0,0)f{0)d0 


> 


R(0,0f)  -  R(0,9)]  f(0)d9 


>6>o+e 


f(0)dd 


0  o  —  e 

0q-\-€ 


>  ~ 


V 

^  JO  o  —  e 

>  0. 


/(0)d0 


Hence,  r(/,  6?)  >  r(/,  0).  This  implies  that  6^  does  not  minimize  r(/,  0)  which 
contradicts  the  fact  that  0 f  is  the  Bayes  rule.  ■ 


12.20  Theorem.  Let  Xi, . . . ,  Xn  ~  a2).  Under  squared  error  loss ,  X  is 

admissible. 


The  proof  of  the  last  theorem  is  quite  technical  and  is  omitted  but  the  idea 
is  as  follows:  The  posterior  mean  is  admissible  for  any  strictly  positive  prior. 
Take  the  prior  to  be  N(ayb2).  When  b2  is  very  large,  the  posterior  mean  is 
approximately  equal  to  X. 

How  are  minimaxity  and  admissibility  linked?  In  general,  a  rule  may  be  one, 
both,  or  neither.  But  here  are  some  facts  linking  admissibility  and  minimaxity. 

12.21  Theorem.  Suppose  that  9  has  constant  risk  and  is  admissible.  Then  it 
is  minimax. 

Proof.  The  risk  is  R(6,6)  =  c  for  some  c.  If  9  were  not  minimax  then 
there  exists  a  rule  9 '  such  that 

R{9, 9')  <  sup  R(Q ,  9')  <  sup  R(9 ,  9)  =  c. 

0  0 

This  would  imply  that  9  is  inadmissible.  ■ 

Now  we  can  prove  a  restricted  version  of  Theorem  12.14  for  squared  error 
loss. 

12.22  Theorem.  Let  Xl7 . . .  ,Xn  ~  N(9, 1).  Then ,  under  squared  error  loss , 
9  =  X  is  minimax. 

Proof.  According  to  Theorem  12.20,  9  is  admissible.  The  risk  of  9  is  1/n 
which  is  constant.  The  result  follows  from  Theorem  12.21.  ■ 


204 


12.  Statistical  Decision  Theory 


Although  minimax  rules  are  not  guaranteed  to  be  admissible  they  are  “close 
to  admissible.”  Say  that  6  is  strongly  inadmissible  if  there  exists  a  rule  O' 
and  an  e  >  0  such  that  R(6,  O')  <  R(6 , 0)  —  e  for  all  6. 

12.23  Theorem.  If  0  is  minimax ,  then  it  is  not  strongly  inadmissible. 


12.7  Stein’s  Paradox 

Suppose  that  X  ~  N(6 , 1)  and  consider  estimating  6  with  squared  error  loss. 
From  the  previous  section  we  know  that  6(X)  =  X  is  admissible.  Now  consider 
estimating  two,  unrelated  quantities  0  =  (61,62)  and  suppose  that  X\  ~ 
N(6\,  1)  and  X2  ~  N(&2, 1)  independently,  with  loss  L(6,  6)  =  Y^}=i(®j  ~ @j)2- 
Not  surprisingly,  6(X)  =  X  is  again  admissible  where  X  =  (Xi,X2).  Now 
consider  the  generalization  to  k  normal  means.  Let  0  =  (61, ...  ,6k),  X  = 
(X\, . . . ,  Xk)  with  Xi  ~  N(6i ,  1)  (independent)  and  loss  L(6 ,  6)  =  Ylj=i(@j  ~ 
Oj)2.  Stein  astounded  everyone  when  he  proved  that,  if  k  >  3,  then  6(X)  =  X 
is  inadmissible.  It  can  be  shown  that  the  James- Stein  estimator  6s  has 
smaller  risk,  where  6s  =  (6f , ...  ,6%), 

ef  (X)  =  (1  -  Xi  (12.12) 

and  (z)+  =  ma x{z,  0}.  This  estimator  shrinks  the  X^’s  towards  0.  The  message 
is  that,  when  estimating  many  parameters,  there  is  great  value  in  shrinking  the 
estimates.  This  observation  plays  an  important  role  in  modern  nonparametric 
function  estimation. 


12.8  Bibliographic  Remarks 

Aspects  of  decision  theory  can  be  found  in  Casella  and  Berger  (2002),  Berger 
(1985),  Ferguson  (1967),  and  Lehmann  and  Casella  (1998). 


12.9  Exercises 

1.  In  each  of  the  following  models,  find  the  Bayes  risk  and  the  Bayes  esti¬ 
mator,  using  squared  error  loss. 

(a)  X  ~  Binomial(n,p),  p  ~  Beta  (a,/?). 
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(b)  x  Poisson(A),  A  ^  Gamma (a,/3). 

(c)  X  ~  N(0 ,  cr2)  where  cr2  is  known  and  6  ~  N(a ,  62). 

2.  Let  Xi, . . . ,  Xn  ~  N(6,  a2)  and  suppose  we  estimate  0  with  loss  function 
L(0,  6)  =  (6  —  0)2 /a2.  Show  that  X  is  admissible  and  minimax. 

3.  Let  ©  =  {6h, . . . ,  Ok}  be  a  finite  parameter  space.  Prove  that  the  poste¬ 
rior  mode  is  the  Bayes  estimator  under  zero-one  loss. 

4.  (Casella  and  Berger  (2002).)  Let  Xi, . . .  ,Xn  be  a  sample  from  a  distri¬ 
bution  with  variance  a2.  Consider  estimators  of  the  form  bS 2  where  S 2 
is  the  sample  variance.  Let  the  loss  function  for  estimating  a2  be 

L(<72,a2)  =  ^  — i— log(X). 

Find  the  optimal  value  of  b  that  minimizes  the  risk  for  all  a2. 

5.  (Berliner  (1983).)  Let  X  ~  Binomial(n,p)  and  suppose  the  loss  function 
is 

Up, pi  =  (i  -  jj) 

where  0  <  p  <  1.  Consider  the  estimator  p(X)  =  0.  This  estimator  falls 
outside  the  parameter  space  (0, 1)  but  we  will  allow  this.  Show  that 
p(X)  =  0  is  the  unique,  minimax  rule. 

6.  (Computer  Experiment.)  Compare  the  risk  of  the  mle  and  the  James- 
Stein  estimator  (12.12)  by  simulation.  Try  various  values  of  n  and  vari¬ 
ous  vectors  6.  Summarize  your  results. 
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Linear  and  Logistic  Regression 


Regression  is  a  method  for  studying  the  relationship  between  a  response 
variable  Y  and  a  covariate  X.  The  covariate  is  also  called  a  predictor 
variable  or  a  feature.  1  One  way  to  summarize  the  relationship  between  X 
and  Y  is  through  the  regression  function 

r(x)  =  E(T|X  =  x)  =  J  yf(y\x)dy.  (13.1) 

Our  goal  is  to  estimate  the  regression  function  r(x)  from  data  of  the  form 

(YuX1),...,(Yn,Xn)~Fx,Y- 

In  this  Chapter,  we  take  a  parametric  approach  and  assume  that  r  is  linear. 
In  Chapters  20  and  21  we  discuss  nonparametric  regression. 

13.1  Simple  Linear  Regression 

The  simplest  version  of  regression  is  when  Xi  is  simple  (one-dimensional)  and 
r(x)  is  assumed  to  be  linear: 

r(x)  =  A)  +  [3ix. 


1-rhe  term  “regression”  is  due  to  Sir  Francis  Galton  (1822-1911)  who  noticed  that  tall  and 
short  men  tend  to  have  sons  with  heights  closer  to  the  mean.  He  called  this  “regression  towards 
the  mean.” 
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4.0  4.5  5.0  5.5 

log  light  intensity  (X) 


FIGURE  13.1.  Data  on  nearby  stars.  The  solid  line  is  the  least  squares  line. 

This  model  is  called  the  the  simple  linear  regression  model.  We  will  make 
the  further  simplifying  assumption  that  V(e^|X  =  x)  =  a2  does  not  depend 
on  x.  We  can  thus  write  the  linear  regression  model  as  follows. 

13.1  Definition.  The  Simple  Linear  Regression  Model 

Yi  =  (3  o  +  (3\  Xi  +  6i  (13.2) 

where  E(e^|X^)  =  0  and  V(e^|X^)  =  a2 . 

13.2  Example.  Figure  13.1  shows  a  plot  of  log  surface  temperature  (Y)  versus 
log  light  intensity  (X)  for  some  nearby  stars.  Also  on  the  plot  is  an  estimated 
linear  regression  line  which  will  be  explained  shortly.  ■ 

The  unknown  parameters  in  the  model  are  the  intercept  /3q  and  the  slope 
Pi  and  the  variance  a2.  Let  Pq  and  /A  denote  estimates  of  Pq  and  Pi.  The 

fitted  line  is 

r(x)  =  Po  +  P\x.  (13.3) 

The  predicted  values  or  fitted  values  are  Yi  =  r(Xi)  and  the  residuals 
are  defined  to  be 

G  =  ~  Yi  =  Yi  —  ^Pq  +  P\Xi 


(13.4) 


13.1  Simple  Linear  Regression 
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The  residual  sums  of  squares  or  RSS,  which  measures  how  well  the  line  fits 
the  data,  is  defined  by  RSS  =  JAi=i  C2* 


13.3  Definition.  The  least  squares  estimates  are  the  values  /3q  and  f3\ 

that  minimize  RSS  = 


13.4  Theorem.  The  least  squares  estimates  are  given  by 


A 

A) 


Eti  (Xi-Xn)(Yi-Yn) 


An  unbiased  estimate  of  a2  is 


(13.5) 

(13.6) 


(13.7) 


13.5  Example.  Consider  the  star  data  from  Example  13.2.  The  least  squares 
estimates  are  /3q  =  3.58  and  f3\  =  0.166.  The  fitted  line  r(x)  =  3.58  +  0.166  x 
is  shown  in  Figure  13.1.  ■ 

13.6  Example  (The  2001  Presidential  Election).  Figure  13.2  shows  the  plot  of 
votes  for  Buchanan  (Y)  versus  votes  for  Bush  (X)  in  Florida.  The  least  squares 
estimates  (omitting  Palm  Beach  County)  and  the  standard  errors  are 

A  =  66.0991  se(/?0)  =  17.2926 

A  =  0.0035  se(A)  =  0.0002. 

The  fitted  line  is 

Buchanan  =  66.0991  +  0.0035  Bush. 

(We  will  see  later  how  the  standard  errors  were  computed.)  Figure  13.2  also 
shows  the  residuals.  The  inferences  from  linear  regression  are  most  accurate 
when  the  residuals  behave  like  random  normal  numbers.  Based  on  the  residual 
plot,  this  is  not  the  case  in  this  example.  If  we  repeat  the  analysis  replacing 
votes  with  log  (votes)  we  get 

A  =  -2.3298  se(3o)  =  0.3529 

A  =  0.730300  se(A)  =  0.0358. 
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0  125000  250000 


Bush 


Bush 


7  8  9  10  11  12  13  7  8  9  10  11  12  13 


Bush  Bush 

FIGURE  13.2.  Voting  Data  for  Election  2000.  See  example  13.6. 

This  gives  the  fit 

log(Buchanan)  =  —2.3298  +  0.7303  log(Bush). 

The  residuals  look  much  healthier.  Later,  we  shall  address  the  following  ques¬ 
tion:  how  do  we  see  if  Palm  Beach  County  has  a  statistically  plausible  out¬ 
come?  ■ 

13.2  Least  Squares  and  Maximum  Likelihood 

Suppose  we  add  the  assumption  that  €i\Xi  ^  iV(0,  cr2),  that  is, 

Yi\Xi  ~N(m,<T2) 


13.2  Least  Squares  and  Maximum  Likelihood 
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where  fii  =  (3q  +  f3\Xi.  The  likelihood  function  is 


JlfiXuYi) 

i=  1 


Ylfxix^xmxi) 

i= 1 

n  n 


]Jfx(Xz)  x  n/rix^lXO 
2  =  1  2=1 

jCi  X  ^2 


where  A  =  n"=i  fx{Xi)  and 


C2  =  WfY\x(Yi\Xi). 

i= 1 


(13.8) 


The  term  C\  does  not  involve  the  parameters  (3o  and  f3\.  We  shall  focus  on 
the  second  term  C2  which  is  called  the  conditional  likelihood,  given  by 


n 

C2  =  £((3o,0i,v)  =  Y[fY\x(Yi\Xi)  ^ 

i=  1 

The  conditional  log-likelihood  is 


hi) 


t((3o,Pu<r) 


nloga-  -LyYu  ~  (Po  +/3iV)) 
G  1=1  '  ' 


(13.9) 


To  find  the  mle  of  (/?o?/?i)  we  maximize  £(/3o,  /?i,  a).  From  (13.9)  we  see  that 
maximizing  the  likelihood  is  the  same  as  minimizing  the  RSS  (^i  ~  (A)  + 

\2 

(3\Xi)  \  .  Therefore,  we  have  shown  the  following: 


13.7  Theorem.  Under  the  assumption  of  Normality,  the  least  squares  estima¬ 
tor  is  also  the  maximum  likelihood  estimator. 


We  can  also  maximize  £((3q,  a)  over  cr,  yielding  the  MLE 


a 


2 


(13.10) 


This  estimator  is  similar  to,  but  not  identical  to,  the  unbiased  estimator. 
Common  practice  is  to  use  the  unbiased  estimator  (13.7). 
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13.3  Properties  of  the  Least  Squares  Estimators 

We  now  record  the  standard  errors  and  limiting  distribution  of  the  least 
squares  estimator.  In  regression  problems,  we  usually  focus  on  the  proper¬ 
ties  of  the  estimators  conditional  on  Xn  =  (Xi, . . .  ,  Xn).  Thus,  we  state  the 
means  and  variances  as  conditional  means  and  variances. 


13.8  Theorem.  Let  f3T  =  (f3o,f3\)T  denote  the  least  squares  estimators. 
Then , 


E(/3|Xn) 


V(/3|Xn)  = 


where  s2x  =  n  1  E”=i(V  -  Xn) 


"2  ( 

n  X/i= 1 

nsx  \ 

,  -xn 

Xn )2. 

■i  -w 


(13.11) 


The  estimated  standard  errors  of  (3$  and  (3\  are  obtained  by  taking  the 
square  roots  of  the  corresponding  diagonal  terms  of  W((3 \Xn)  and  inserting 
the  estimate  a  for  a .  Thus, 


se(/30)  = 

se(hi)  = 


£?=i  V2 


sx  Vn 
a 

Sx  Vn 


(13.12) 

(13.13) 


We  should  really  write  these  as  se(/3o|Xn)  and  se(/3i|Xn)  but  we  will  use  the 
shorter  notation  se(/?o)  and  se(/?i). 

13.9  Theorem.  Under  appropriate  conditions  we  have: 

^  p  ^  p 

1.  (Consistency):  (3$ — >  f3o  and  (3\ — >  (3 

2.  (Asymptotic  Normality): 

~A°  N( 0, 1)  and  ^  N{ 0,1). 

se(/?0)  se(/?i) 

3.  Approximate  1  —  a  confidence  intervals  for  (3q  and  (3\  are 


±  za/2  se(3o)  and  hi  ±  za/2  se(hi) 


(13.14) 


13.4  Prediction 
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4-  The  Wald  test  2  for  testing  Ho  :  f3\  =  0  versus  Hi  :  f3\  7^  0  is:  reject  Ho 
if  \W\  >  za/2  where  W  =  /A/se(/A)* 

13.10  Example.  For  the  election  data,  on  the  log  scale,  a  95  percent  confi¬ 
dence  interval  is  .7303  db  2(.0358)  =  (.66,  .80).  The  Wald  statistics  for  testing 
Ho  •  Pi  =  0  versus  Hi  :  f3\  7^  0  is  \W\  =  | .7303  —  0|/.0358  =  20.40  with  a 
p-value  ofP(|Z|  >  20.40)  ~  0.  This  is  strong  evidence  that  that  the  true  slope 
is  not  0.  ■ 


13.4  Prediction 

Suppose  we  have  estimated  a  regression  model  r(x)  =  /3q  +  Pix  from  data 
(Xi,  Yi), . . . ,  (Xn,Yn).  We  observe  the  value  X  =  x*  of  the  covariate  for  a 
new  subject  and  we  want  to  predict  their  outcome  Y*.  An  estimate  of  Y*  is 

Y*  =  /do  T  /3ix *.  (13.15) 

Using  the  formula  for  the  variance  of  the  sum  of  two  random  variables, 

V(Y*)  =  V(/30  +  /?ix»)  =  V(/30)  +a;*V(/?i)  +  2x*Cov(/30,  Pi). 

Theorem  13.8  gives  the  formulas  for  all  the  terms  in  this  equation.  The  es¬ 
timated  standard  error  se(Y*)  is  the  square  root  of  this  variance,  with  a2  in 
place  of  a2.  However,  the  confidence  interval  for  Y*  is  not  of  the  usual  form 
Y*  db  z^j 2se.  The  reason  for  this  is  explained  in  Exercise  10.  The  correct  form 
of  the  confidence  interval  is  given  in  the  following  theorem. 


13.11  Theorem  (Prediction  Interval).  Let 


2~2  fZt  l(V-V)2  ,  O 
"  Uzdv-X)2  +  )■ 


(13.16) 


An  approximate  1  —  a  prediction  interval  for  YS:  is 


y*  ±  ^a/2  t 


(13.17) 


2Recall  from  equation  (10.5)  that  the  Wald  statistic  for  testing  Hq  :  fi 

PAPo  is  W  =  0-j3o)/se(P). 


do  versus  H\ 
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13.12  Example  (Election  Data  Revisited).  On  the  log  scale,  our  linear  regres¬ 
sion  gives  the  following  prediction  equation: 

log(Buchanan)  =  —2.3298  +  0.7303  log(Bush). 

In  Palm  Beach,  Bush  had  152,954  votes  and  Buchanan  had  3,467  votes.  On  the 
log  scale  this  is  11.93789  and  8.151045.  How  likely  is  this  outcome,  assuming 
our  regression  model  is  appropriate?  Our  prediction  for  log  Buchanan  votes 
-2.3298  +  .7303  (11.93789)=6.388441.  Now,  8.151045  is  bigger  than  6.388441 
but  is  it  “significantly”  bigger?  Let  us  compute  a  confidence  interval.  We 
find  that  £n  =  .093775  and  the  approximate  95  percent  confidence  interval  is 
(6.200,6.578)  which  clearly  excludes  8.151.  Indeed,  8.151  is  nearly  20  standard 
errors  from  Y* .  Going  back  to  the  vote  scale  by  exponentiating,  the  confidence 
interval  is  (493,717)  compared  to  the  actual  number  of  votes  which  is  3,467. 


13.5  Multiple  Regression 


Now  suppose  that  the  covariate  is  a  vector  of  length  k.  The  data  are  of  the 
form 


where 


Here,  X i  is  the  vector  of  k  covariate  values  for  the  ith  observation.  The  linear 
regression  model  is 

k 

Yi  =  ^  ]  Pj Xjj  +  €i  (13.18) 

3  = 1 

for  i  =  1, . . . ,  n,  where  E(e^|Xn, . . . ,  X^i)  =  0.  Usually  we  want  to  include  an 
intercept  in  the  model  which  we  can  do  by  setting  Xu  =  1  for  i  =  1, . . . ,  n.  At 
this  point  it  will  be  more  convenient  to  express  the  model  in  matrix  notation. 
The  outcomes  will  be  denoted  by 


Y 


on 


V  v  j 
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and  the  covariates  will  be  denoted  by 


Xu  X\2  •  •  •  Xlk 
X2I  X22  •  •  •  X2k 


Each  row  is  one  observation;  the  columns  correspond  to  the  k  covariates.  Thus. 
X  is  a  (n  x  k)  matrix.  Let 


P  = 


and  e  = 


Then  we  can  write  (13.18)  as 


Y  =  X/3  +  e. 


(13.19) 


The  form  of  the  least  squares  estimate  is  given  in  the  following  theorem. 


The  estimate  regression  function  is  f(x) 
mate  of  a2  is 


Eti  Pm 


.  An  unbiased  esti- 


where  'e  =  X f3  —  Y  is  the  vector  of  residuals.  An  approximate  1  —  a  confidence 
interval  for  (3j  is 

Pj  ±za/2se0j)  (13.23) 

where  se2 (/’!,)  is  the  jth  diagonal  element  of  the  matrix  <r2  (XTX)~1. 


13.14  Example.  Crime  data  on  47  states  in  1960  can  be  obtained  from 
http://lib.stat.cmu.edu/DASL/Stories/USCrime.html. 

If  we  fit  a  linear  regression  of  crime  rate  on  10  variables  we  get  the  following: 
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Covariate 

(3 j 

se(/3 j) 

t  value 

p-value 

(Intercept) 

-589.39 

167.59 

-3.51 

0.001  ** 

Age 

1.04 

0.45 

2.33 

0.025  * 

Southern  State 

11.29 

13.24 

0.85 

0.399 

Education 

1.18 

0.68 

1.7 

0.093 

Expenditures 

0.96 

0.25 

3.86 

0.000  *** 

Labor 

0.11 

0.15 

0.69 

0.493 

Number  of  Males 

0.30 

0.22 

1.36 

0.181 

Population 

0.09 

0.14 

0.65 

0.518 

Unemployment  (14-24) 

-0.68 

0.48 

-1.4 

0.165 

Unemployment  (25-39) 

2.15 

0.95 

2.26 

0.030  * 

Wealth 

-0.08 

0.09 

-0.91 

0.367 

This  table  is  typical  of  the  output  of  a  multiple  regression  program.  The  “t- 

value”  is  the  Wald  test  statistic  for  testing  Ho  :  f3j  =  0  versus  Hi  :  (3j  ^  0.  The 
asterisks  denote  “degree  of  significance”  and  more  asterisks  denote  smaller 
p- values.  The  example  raises  several  important  questions:  (1)  should  we  elim¬ 
inate  some  variables  from  this  model?  (2)  should  we  interpret  these  relation¬ 
ships  as  causal?  For  example,  should  we  conclude  that  low  crime  prevention 
expenditures  cause  high  crime  rates?  We  will  address  question  (1)  in  the  next 
section.  We  will  not  address  question  (2)  until  Chapter  16.  ■ 


13.6  Model  Selection 

Example  13.14  illustrates  a  problem  that  often  arises  in  multiple  regression. 
We  may  have  data  on  many  covariates  but  we  may  not  want  to  include  all  of 
them  in  the  model.  A  smaller  model  with  fewer  covariates  has  two  advantages: 
it  might  give  better  predictions  than  a  big  model  and  it  is  more  parsimonious 
(simpler).  Generally,  as  you  add  more  variables  to  a  regression,  the  bias  of  the 
predictions  decreases  and  the  variance  increases.  Too  few  covariates  yields  high 
bias;  this  called  underfitting.  Too  many  covariates  yields  high  variance;  this 
called  overfitting.  Good  predictions  result  from  achieving  a  good  balance 
between  bias  and  variance. 

In  model  selection  there  are  two  problems:  (i)  assigning  a  “score”  to  each 
model  which  measures,  in  some  sense,  how  good  the  model  is,  and  (ii)  search¬ 
ing  through  all  the  models  to  find  the  model  with  the  best  score. 

Let  us  first  discuss  the  problem  of  scoring  models.  Let  S  C  {1, . . . ,  k}  and 
let  Xs  =  {Xj  :  j  G  S}  denote  a  subset  of  the  covariates.  Let  (3$  denote  the 
coefficients  of  the  corresponding  set  of  covariates  and  let  (3s  denote  the  least 
squares  estimate  of  (3s-  Also,  let  Xs  denote  the  X  matrix  for  this  subset  of 
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covariates  and  define  rs(x)  to  be  the  estimated  regression  function.  The  pre¬ 
dicted  values  from  model  S  are  denoted  by  Yi(S)  =  rs(Xi).  The  prediction 
risk  is  defined  to  be 


R(S)  =  Y  E(ti(5)  -  Y*f  (13.24) 

i= 1 

where  Y*  denotes  the  value  of  a  future  observation  of  Yi  at  covariate  value 
Xi.  Our  goal  is  to  choose  S  to  make  R(S)  small. 

The  training  error  is  defined  to  be 

n 

^tr(5)  =  ^(fi(5)-ri)2. 

i=  1 

This  estimate  is  very  biased  as  an  estimate  of  R(S). 

13.15  Theorem.  The  training  error  is  a  downward-biased  estimate  of  the  pre¬ 
diction  risk: 

E (Rtr(S))  <  R(S). 

In  fact , 


bias (Rtr(S))  =  E (Rtr(S))  -  R(S)  =  -2  Y  Co v(Yu  Yi).  (13.25) 

i= 1 


The  reason  for  the  bias  is  that  the  data  are  being  used  twice:  to  estimate 
the  parameters  and  to  estimate  the  risk.  When  we  fit  a  complex  model  with 
many  parameters,  the  covariance  Cov(Y^,  Y^)  will  be  large  and  the  bias  of  the 
training  error  gets  worse.  Here  are  some  better  estimates  of  risk. 

Mallow’s  Cp  statistic  is  defined  by 


R(S)  =  Rtr(S)  +  2\S 


a 


(13.26) 


where  \S\  denotes  the  number  of  terms  in  S  and  a2  is  the  estimate  of  a2 
obtained  from  the  full  model  (with  all  covariates  in  the  model).  This  is  simply 
the  training  error  plus  a  bias  correction.  This  estimate  is  named  in  honor  of 
Cohn  Mallows  who  invented  it.  The  first  term  in  (13.26)  measures  the  fit  of 
the  model  while  the  second  measure  the  complexity  of  the  model.  Think  of 
the  Cp  statistic  as: 


lack  of  fit  +  complexity  penalty. 

Thus,  finding  a  good  model  involves  trading  off  fit  and  complexity. 
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A  related  method  for  estimating  risk  is  AIC  (Akaike  Information  Cri¬ 
terion).  The  idea  is  to  choose  S  to  maximize 

is  ~  |S|  (13.27) 

where  is  is  the  log-likelihood  of  the  model  evaluated  at  the  mle.  3  This  can 
be  thought  of  “goodness  of  fit”  minus  “complexity.”  In  linear  regression  with 
Normal  errors  (and  taking  a  equal  to  its  estimate  from  the  largest  model), 
maximizing  AIC  is  equivalent  to  minimizing  Mallow’s  Cp ;  see  Exercise  8.  The 
appendix  contains  more  explanation  about  AIC. 

Yet  another  method  for  estimating  risk  is  leave-one-out  cross-validation. 
In  this  case,  the  risk  estimator  is 


n 


RCV(S)  =  J2(Yi-%))' 


(13.28) 


i=  1 


where  Y^  is  the  prediction  for  Y*  obtained  by  fitting  the  model  with  Y*  omit¬ 
ted.  It  can  be  shown  that 

Yi-Yi(S)"  2 


n 


Rcv(S)  =  E 


i= 1 


l-Uu(S) 


(13.29) 


where  Ua(S)  is  the  ith  diagonal  element  of  the  matrix 


U(S)  =  Xs(XjXs)-1^. 


1  vT 


(13.30) 


Thus,  one  need  not  actually  drop  each  observation  and  re- fit  the  model.  A 
generalization  is  k-fold  cross-validation.  Here  we  divide  the  data  into  k 
groups;  often  people  take  k  =  10.  We  omit  one  group  of  data  and  fit  the 
models  to  the  remaining  data.  We  use  the  fitted  model  to  predict  the  data 
in  the  group  that  was  omitted.  We  then  estimate  the  risk  by  —  Yi)2 

where  the  sum  is  over  the  the  data  points  in  the  omitted  group.  This  process  is 
repeated  for  each  of  the  k  groups  and  the  resulting  risk  estimates  are  averaged. 

For  linear  regression,  Mallows  Cp  and  cross-validation  often  yield  essentially 
the  same  results  so  one  might  as  well  use  Mallows’  method.  In  some  of  the 
more  complex  problems  we  will  discuss  later,  cross-validation  will  be  more 
useful. 

Another  scoring  method  is  BIC  (Bayesian  information  criterion).  Here  we 
choose  a  model  to  maximize 


BIC(5)=4- Alogn 


(13.31) 


3Some  texts  use  a  slightly  different  definition  of  AIC  which  involves  multiplying  the  definition 
here  by  2  or  -2.  This  has  no  effect  on  which  model  is  selected. 
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The  BIC  score  has  a  Bayesian  interpretation.  Let  S  =  {Si, . . . ,  Sm}  denote 
a  set  of  models.  Suppose  we  assign  the  prior  IP \Sj)  =  1/m  over  the  models. 
Also,  assume  we  put  a  smooth  prior  on  the  parameters  within  each  model.  It 
can  be  shown  that  the  posterior  probability  for  a  model  is  approximately, 

eBIC{S0 ) 


IP  (Sj  |  data) 


r ^ 


Y/  eBIC(Sr ) 


Hence,  choosing  the  model  with  highest  BIC  is  like  choosing  the  model  with 
highest  posterior  probability.  The  BIC  score  also  has  an  information-theoretic 
interpretation  in  terms  of  something  called  minimum  description  length.  The 
BIC  score  is  identical  to  Mallows  Cp  except  that  it  puts  a  more  severe  penalty 
for  complexity.  It  thus  leads  one  to  choose  a  smaller  model  than  the  other 
methods. 

Now  let  us  turn  to  the  problem  of  model  search.  If  there  are  k  covariates 
then  there  are  2k  possible  models.  We  need  to  search  through  all  these  models, 
assign  a  score  to  each  one,  and  choose  the  model  with  the  best  score.  If  k  is 
not  too  large  we  can  do  a  complete  search  over  all  the  models.  When  k  is  large, 
this  is  infeasible.  In  that  case  we  need  to  search  over  a  subset  of  all  the  models. 
Two  common  methods  are  forward  and  backward  stepwise  regression. 
In  forward  stepwise  regression,  we  start  with  no  covariates  in  the  model.  We 
then  add  the  one  variable  that  leads  to  the  best  score.  We  continue  adding 
variables  one  at  a  time  until  the  score  does  not  improve.  Backwards  stepwise 
regression  is  the  same  except  that  we  start  with  the  biggest  model  and  drop 
one  variable  at  a  time.  Both  are  greedy  searches;  nether  is  guaranteed  to 
find  the  model  with  the  best  score.  Another  popular  method  is  to  do  random 
searching  through  the  set  of  all  models.  However,  there  is  no  reason  to  expect 
this  to  be  superior  to  a  deterministic  search. 


13.16  Example.  We  applied  backwards  stepwise  regression  to  the  crime  data 
using  AIC.  The  following  was  obtained  from  the  program  R.  This  program 
uses  a  slightly  different  definition  of  AIC.  With  their  definition,  we  seek  the 
smallest  (not  largest)  possible  AIC.  This  is  the  same  is  minimizing  Mallows 

Cp. 

The  full  model  (which  includes  all  covariates)  has  AIC=  310.37.  In  ascend¬ 
ing  order,  the  AIC  scores  for  deleting  one  variable  are  as  follows: 


variable 

Pop 

Labor 

South 

Wealth 

Males 

U1 

Educ. 

U2 

Age 

Expend 

AIC 

308 

309 

309 

309 

310 

310 

312 

314 

315 

324 

For  example,  if  we  dropped  Pop  from  the  model  and  kept  the  other  terms, 
then  the  AIC  score  would  be  308.  Based  on  this  information  we  drop  “pop- 
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illation”  from  the  model  and  the  current  AIC  score  is  308.  Now  we  consider 
dropping  a  variable  from  the  current  model.  The  AIC  scores  are: 


variable 

South 

Labor 

Wealth 

Males 

U1 

Education 

U2 

Age 

Expend 

AIC 

308 

308 

308 

309 

309 

310 

313 

313 

329 

We  then  drop  “Southern”  from  the  model.  This  process  is  continued  until 
there  is  no  gain  in  AIC  by  dropping  any  variables.  In  the  end,  we  are  left  with 
the  following  model: 

Crime  =  1.2  Age  +  .75  Education  +  .87  Expenditure 

+  .34  Males  -  .86  U1  +  2.31  U2. 

Warning!  This  does  not  yet  address  the  question  of  which  variables  are 
causes  of  crime.  ■ 

There  is  another  method  for  model  selection  that  avoids  having  to  search 
through  all  possible  models.  This  method,  which  is  due  to  Zheng  and  Loh 
(1995),  does  not  seek  to  minimize  prediction  errors.  Rather,  it  assumes  some 
subset  of  the  (3^ s  are  exactly  equal  to  0  and  tries  to  find  the  true  model, 
that  is,  the  smallest  sub-model  consisting  of  nonzero  (3j  terms.  The  method 
is  carried  out  as  follows. 


Zheng-Loh  Model  Selection  Method  4 

1.  Fit  the  full  model  with  all  k  covariates  and  let  Wj  =  (3j/se((3j)  denote 

the  Wald  test  statistic  for  Hq  :  f3j  =  0  versus  H i  :  f3j  ^  0. 

2.  Order  the  test  statistics  from  largest  to  smallest  in  absolute  value: 


A  |VE(2)  |  A 


>  m 


(k) 


3.  Let  j  be  the  value  of  j  that  minimizes 

RSS (j)  +  j  a2  log  n 

where  RSS  (j)  is  the  residual  sums  of  squares  from  the  model  with 
the  j  largest  Wald  statistics. 

4.  Choose,  as  the  final  model,  the  regression  with  the  j  terms  with  the 

largest  absolute  Wald  statistics. 


13.7  Logistic  Regression 


223 


A 


0  x 

FIGURE  13.3.  The  logistic  function  p  =  ex / (1  -\-  ex) . 


Zheng  and  Loh  showed  that,  under  appropriate  conditions,  this  method 
chooses  the  true  model  with  probability  tending  to  one  as  the  sample  size 
increases. 
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So  far  we  have  assumed  that  is  real  valued.  Logistic  regression  is  a  para¬ 
metric  method  for  regression  when  Yi  E  {0, 1}  is  binary.  For  a  /c-dimensional 
covariate  X,  the  model  is 


Pi  =Pi(P)  =  W(Yi 


l\X  =  x) 


_|_  g/L)+Xb  =  l  PjXij 


(13.32) 


or,  equivalently, 


where 


k 

logit  (Pi)  =  f30  + ^2  @3xij 

3  = 1 


logit  (p)  =  log 


V 


1  —  p 


(13.33) 


(13.34) 


The  name  “logistic  regression”  comes  from  the  fact  that  ex /(l  +  ex)  is  called 
the  logistic  function.  A  plot  of  the  logistic  for  a  one-dimensional  covariate  is 
shown  in  Figure  13.3. 

Because  the  IV s  are  binary,  the  data  are  Bernoulli: 


Bernoulli(pi). 


Hence  the  (conditional)  likelihood  function  is 

n 

£(P)  =  L [Pi(P)Yi0-  ~Pi(P))l~Yi 

i= 1 


(13.35) 


4This  is  just  one  version  of  their  method.  In  particular,  the  penalty  j  log  n  is  only  one  choice 
from  a  set  of  possible  penalty  functions. 
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The  mle  (3  has  to  be  obtained  by  maximizing  C{j3)  numerically.  There  is 
a  fast  numerical  algorithm  called  reweighted  least  squares.  The  steps  are  as 
follows: 


Reweighted  Least  Squares  Algorithm 

Choose  starting  values  /3°  =  (/?§ ,  •  •  • ,  /3£)  and  compute  p®  using  equation 
(13.32),  for  i  =  1, . . . ,  n.  Set  5  =  0  and  iterate  the  following  steps  until 
convergence. 


1.  Set 


Zi  =  logit  (pf)  +  +  ,  i  =  1, 

Pi(i  -pf) 


2.  Let  W  be  a  diagonal  matrix  with  (i,  i)  element  equal  to  pf  (1  —  pf) 


3.  Set 


(3s  =  (X1WX)~1XtWY. 


This  corresponds  to  doing  a  (weighted)  linear  regression  of  Z  on  7. 


4.  Set  5  =  5  +  1  and  go  back  to  the  first  step. 


The  Fisher  information  matrix  I  can  also  be  obtained  numerically.  The 
estimate  standard  error  of  (3j  is  the  (j,j)  element  of  J  =  7_1.  Model  selection 
is  usually  done  using  the  AIC  score  is  —  |-S|. 


13.17  Example.  The  Coronary  Risk-Factor  Study  (CORIS)  data  involve  462 
males  between  the  ages  of  15  and  64  from  three  rural  areas  in  South  Africa, 
(Rousseauw  et  al.  (1983)).  The  outcome  Y  is  the  presence  (Y  =  1)  or  absence 
(Y  =  0)  of  coronary  heart  disease.  There  are  9  covariates:  systolic  blood 
pressure,  cumulative  tobacco  (kg),  ldl  (low  density  lipoprotein  cholesterol), 
adiposity,  famhist  (family  history  of  heart  disease),  typea  (type- A  behavior), 
obesity,  alcohol  (current  alcohol  consumption),  and  age.  A  logistic  regression 
yields  the  following  estimates  and  Wald  statistics  Wj  for  the  coefficients: 
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Covariate 

f3j 

se 

Wj 

p-value 

Intercept 

-6.145 

1.300 

-4.738 

0.000 

sbp 

0.007 

0.006 

1.138 

0.255 

tobacco 

0.079 

0.027 

2.991 

0.003 

ldl 

0.174 

0.059 

2.925 

0.003 

adiposity 

0.019 

0.029 

0.637 

0.524 

famhist 

0.925 

0.227 

4.078 

0.000 

typea 

0.040 

0.012 

3.233 

0.001 

obesity 

-0.063 

0.044 

-1.427 

0.153 

alcohol 

0.000 

0.004 

0.027 

0.979 

age 

0.045 

0.012 

3.754 

0.000 

Are  you  surprised  by  the  fact  that  systolic  blood  pressure  is  not  significant 
or  by  the  minus  sign  for  the  obesity  coefficient?  If  yes,  then  you  are  confusing 
association  and  causation.  This  issue  is  discussed  in  Chapter  16.  The  fact 
that  blood  pressure  is  not  significant  does  not  mean  that  blood  pressure  is 
not  an  important  cause  of  heart  disease.  It  means  that  it  is  not  an  important 
predictor  of  heart  disease  relative  to  the  other  variables  in  the  model.  ■ 


13.8  Bibliographic  Remarks 

A  succinct  book  on  linear  regression  is  Weisberg  (1985).  A  data-mining  view 
of  regression  is  given  in  Hastie  et  al.  (2001).  The  Akaike  Information  Criterion 
(AIC)  is  due  to  Akaike  (1973).  The  Bayesian  Information  Criterion  (BIC)  is 
due  to  Schwarz  (1978).  References  on  logistic  regression  include  Agresti  (1990) 
and  Dobson  (2001). 


13.9  Appendix 

The  Akaike  Information  Criterion  (AIC).  Consider  a  set  of  models 
{Mi,  M2, . . .}.  Let  fj  (x)  denote  the  estimated  probability  function  obtained 
by  using  the  maximum  likelihood  estimator  of  model  Mj.  Thus,  fj(x)  = 
where  (3j  is  the  mle  of  the  set  of  parameters  /3j  for  model  Mj.  We 
will  use  the  loss  function  D(/,  /)  where 

W,9)  =  £/(*)  log  (®) 

*JC 

is  the  Kullback-Leibler  distance  between  two  probability  functions.  The  cor¬ 
responding  risk  function  is  R(fyf)  =  E(D(/,  /).  Notice  that  D(/, /)  =  c  — 
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A(f,  f)  where  c  =  ^2X  f(x)  log  f(x)  does  not  depend  on  /  and 

A(f,  /)  =  52  /  (x) log  /(x)- 

X 


Thus,  minimizing  the  risk  is  equivalent  to  maximizing  a(/,  /)  =  E(A(/, /)). 

It  is  tempting  to  estimate  a(/,  /)  by  f(x)  log  f(x)  but,  just  as  the  train¬ 
ing  error  in  regression  is  a  highly  biased  estimate  of  prediction  risk,  it  is  also 
the  case  that  J2X  f(x)  log  f(pc)  is  a  highly  biased  estimate  of  a(/,  /).  In  fact, 
the  bias  is  approximately  equal  to  \Mj\.  Thus: 

13.18  Theorem.  AIC(Mj  )  is  an  approximately  unbiased  estimate  ofa(f,f). 


13.10  Exercises 

1.  Prove  Theorem  13.4. 

2.  Prove  the  formulas  for  the  standard  errors  in  Theorem  13.8.  You  should 
regard  the  X^’s  as  fixed  constants. 

3.  Consider  the  regression  through  the  origin  model: 

Yi  =  f3Xi  +  e. 

Find  the  least  squares  estimate  for  (3.  Find  the  standard  error  of  the 
estimate.  Find  conditions  that  guarantee  that  the  estimate  is  consistent. 

4.  Prove  equation  (13.25). 

5.  In  the  simple  linear  regression  model,  construct  a  Wald  test  for  Ho  : 
f3\  =  17/?o  versus  Hi  :  (3\  ^  17/?o- 

6.  Get  the  passenger  car  mileage  data  from 

http:/ /lib.  stat.cmu.edu/DASL/Datahles/carmpgdat.html 

(a)  Fit  a  simple  linear  regression  model  to  predict  MPG  (miles  per 
gallon)  from  HP  (horsepower).  Summarize  your  analysis  including  a 
plot  of  the  data  with  the  fitted  line. 

(b)  Repeat  the  analysis  but  use  log(MPG)  as  the  response.  Compare 
the  analyses. 
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7.  Get  the  passenger  car  mileage  data  from 
http://lib.stat.cmu.edu/DASL/Datafiles/carmpgdat.html 

(a)  Fit  a  multiple  linear  regression  model  to  predict  MPG  (miles  per 
gallon)  from  the  other  variables.  Summarize  your  analysis. 

(b)  Use  Mallow  Cp  to  select  a  best  sub-model.  To  search  through  the 
models  try  (i)  forward  stepwise,  (ii)  backward  stepwise.  Summarize  your 
findings. 

(c)  Use  the  Zheng-Loh  model  selection  method  and  compare  to  (b). 

(d)  Perform  all  possible  regressions.  Compare  Cp  and  BIC.  Compare  the 
results. 

8.  Assume  a  linear  regression  model  with  Normal  errors.  Take  a  known. 
Show  that  the  model  with  highest  AIC  (equation  (13.27))  is  the  model 
with  the  lowest  Mallows  Cp  statistic. 

9.  In  this  question  we  will  take  a  closer  look  at  the  AIC  method.  Let 
X\, . . . ,  Xn  be  HD  observations.  Consider  two  models  Ado  and  Adi.  Un¬ 
der  Ado  the  data  are  assumed  to  be  N( 0, 1)  while  under  Adi  the  data 
are  assumed  to  be  N (0,1)  for  some  unknown  6  £  R: 

Ado:  X\,...,Xn  ~  N(0,1) 

Adi  :  X\ , . . . ,  Xn  ~  N (0,1),  0  G  M. 

This  is  just  another  way  to  view  the  hypothesis  testing  problem:  Ho  : 
0  =  0  versus  Hi  :  0  /  0.  Let  £n(0)  be  the  log-likelihood  function. 
The  AIC  score  for  a  model  is  the  log-likelihood  at  the  mle  minus  the 
number  of  parameters.  (Some  people  multiply  this  score  by  2  but  that 
is  irrelevant.)  Thus,  the  AIC  score  for  Ado  is  AICo  =  £n(0)  and  the  AIC 
score  for  Adi  is  AIC\  =  £n(0)  —  1.  Suppose  we  choose  the  model  with 
the  highest  AIC  score.  Let  Jn  denote  the  selected  model: 

JO  if  AIC0  >  AIC  i 
Jn  ~  i  1  if  AIC i  >  AICo  • 


(a)  Suppose  that  Ado  is  the  true  model,  i.e.  0  =  0.  Find 

lim  P  ( Jn  =  0) . 

n— >•  oo 

Now  compute  limn^00  IP  ( Jn  =  0)  when  0^0. 
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(b)  The  fact  that  limn^00  P  ( Jn  =  0)^1  when  9  =  0  is  why  some  people 
say  that  AIC  “overfits.”  But  this  is  not  quite  true  as  we  shall  now  see. 
Let  (j)o(x)  denote  a  Normal  density  function  with  mean  0  and  variance 
1.  Define 

7  (T\  _  /  Mx)  if  Jn  =  0 
U[  ’  ~  1  Mx)  if  Jn  =  1. 

If  0  =  0,  show  that  D(<j) o,  fn)  0  as  n  oo  where 


D(f,g) 


f{x)  log 


fix) 

g(x) 


dx 


is  the  Kullback-Leibler  distance.  Show  also  that  D(</>0,  fn )  — )►  0  if  0  ^  0. 
Hence,  AIC  consistently  estimates  the  true  density  even  if  it  “over¬ 
shoots”  the  correct  model. 


(c)  Repeat  this  analysis  for  BIC  which  is  the  log-likelihood  minus  (p/2)  log 
where  p  is  the  number  of  parameters  and  n  is  sample  size. 


10.  In  this  question  we  take  a  closer  look  at  prediction  intervals.  Let  0  = 
Po  +  ftiX*  and  let  6  =  (3 o  +  fiiX*.  Thus,  Y*  =  0  while  Y *  =  0  +  e.  Now, 
0  «  N(6,  se2)  where 

se2  =  V(0)  =  V(/?o  T  /?!#*)• 


Note  that  V(0)  is  the  same  as  V(Y*).  Now,  6±2y  V(0)  is  an  approximate 
95  percent  confidence  interval  for  6  =  /3o+Pix*  using  the  usual  argument 
for  a  confidence  interval.  But,  as  you  shall  now  show,  it  is  not  a  valid 
confidence  interval  for  Y*. 

(a)  Let  s  =  \/v(Y*).  Show  that 


2s  <  Y*  Y*  T  2s) 


«  P  -2  <  N  0, 1  + 


a ' 


^  0.95. 


(b)  The  problem  is  that  the  quantity  of  interest  Y *  is  equal  to  a  param¬ 
eter  9  plus  a  random  variable.  We  can  fix  this  by  defining 

T,i(Xi-X*)2 


£  =  v(y,)  +  <7: 


+  i 


a 


l  nT,i(xi  ~  x )2 

In  practice,  we  substitute  a  for  a  and  we  denote  the  resulting  quantity 
by  <^n.  Now  consider  the  interval  Y*  ±  2  £n.  Show  that 


P(Y*  -  2^n  <  Y*  <  Y*  +  2in)  ~  P  (-2  <  N(0, 1)  <  2)  «  0.95. 
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Get  the  Coronary  Risk-Factor  Study  (CORIS)  data  from  the  book  web 
site.  Use  backward  stepwise  logistic  regression  based  on  AIC  to  select  a 
model.  Summarize  your  results. 
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Multivariate  Models 


In  this  chapter  we  revisit  the  Multinomial  model  and  the  multivariate  Normal. 
Let  us  first  review  some  notation  from  linear  algebra.  In  what  follows,  x  and 
y  are  vectors  and  A  is  a  matrix. 


Linear  Algebra  Notation 


xTy  inner  product  JN  XjUj 

\A\  determinant 

AT  transpose  of  A 

A -1  inverse  of  A 

I  the  identity  matrix 

tr (A)  trace  of  a  square  matrix;  sum  of  its  diagonal  elements 

Al/2 

square  root  matrix 


The  trace  satisfies  tr(AB)  =  tr  (BA)  and  tr(A)  -f  tr  (B).  Also,  tr(a)  =  a  if  a 
is  a  scalar.  A  matrix  is  positive  definite  if  xTY,x  >  0  for  all  nonzero  vectors 
x.  If  a  matrix  A  is  symmetric  and  positive  definite,  its  square  root  A1/2  exists 
and  has  the  following  properties:  (1)  A1/2  is  symmetric;  (2)  A  =  A^A1/2; 
(3)  A1/2 A-1/2  =  A-1/2 A1/2  =  I  where  A-1/2  =  (A1/2)"1. 
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14.1  Random  Vectors 


Multivariate  models  involve  a  random  vector  X  of  the  form 


X  = 


xk 


The  mean  of  a  random  vector  X  is  defined  by 


E(Xy) 


E(Xk) 


(14.1) 


The  covariance  matrix  S,  also  written  V(X),  is  defined  to  be 


S  = 


V(-Xi)  Cov(XuX2) 

Coy(X2,X1)  V(X2) 

Coy(Xk,Xi)  Cov(Xfe,X2) 


^(Xi.Xfc) 

Cov(X2,Xfe) 

V(Xfc) 


(14.2) 


This  is  also  called  the  variance  matrix  or  the  variance-covariance  matrix.  The 
inverse  E_1  is  called  the  precision  matrix. 

14.1  Theorem.  Let  a  be  a  vector  of  length  k  and  let  X  be  a  random  vector 
of  the  same  length  with  mean  fi  and  variance  E.  Then  E(aTX)  =  aT/i  and 
V(aTX)  =  aTEa.  If  A  is  a  matrix  with  k  columns ,  then  E {AX)  =  A(i  and 
V(AX)  =  ATAt. 

Now  suppose  we  have  a  random  sample  of  n  vectors: 


Xn 

X2i 


Xk-_ 


X12 

X22 


Xk2 


X\n 

X2  n 

xkn 


(14.3) 


The  sample  mean  X  is  a  vector  defined  by 


X  = 


14.2  Estimating  the  Correlation 
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where  Xi  =  n  1  Xij.  The  sample  variance  matrix,  also  called  the  co- 

variance  matrix  or  the  variance-covariance  matrix,  is 


where 


Sll 

s  12 

•  • ‘  Sik 

S12 

^22 

• • •  S2k 

S  = 


Sl k  $2k  *  *  *  Skk 


Sab 


Xa)(Xbj  -  Xb) 


(14.4) 


It  follows  that  E(X)  =  p.  and  E (S)  =  E. 


14.2  Estimating  the  Correlation 


Consider  n  data  points  from  a  bivariate  distribution: 


In  W  x12  \  /  xln 

x21  r  v  X22  x2n 


Recall  that  the  correlation  between  Xi  and  X2  is 


E((X!  -  P!)(X2  -  p2)) 
CT1CT2 


(14.5) 


where  =  V(Xj^),  j  =  1,2.  The  nonparametric  plug-in  estimator  is  the 
sample  correlation  1 


YZ=i(Xu-Xi)(X2i-x2) 

sis2 


(14.6) 


where 


1 


n  —  1 


i=  1 


2 


We  can  construct  a  confidence  interval  for  p  by  applying  the  delta  method. 
However,  it  turns  out  that  we  get  a  more  accurate  confidence  interval  by  first 
constructing  a  confidence  interval  for  a  function  0  =  f(p)  and  then  applying 


More  precisely,  the  plug-in  estimator  has  n  rather  than  n  —  1  in  the  formula  for  s?  but  this 


difference  is  small. 
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the  inverse  function  /  1.  The  method,  due  to  Fisher,  is  as  follows:  Define  / 
and  its  inverse  by 


fir) 

r\z) 


-  I  log(l  +  r)  -  log(l  -  r) 

e2z  -  1 
e2z  +  1  ’ 


Approximate  Confidence  Interval  for  The  Correlation 


1.  Compute 


0  =  f(p)  =  —  (  log(f  +  p)  ~  log(l  -  p) 


2.  Compute  the  approximate  standard  error  of  9  which  can  be  shown  to 
be 

se(<?)  =  J— 


'  7  y/n-  3 

3.  An  approximate  1  —  a  confidence  interval  for  0  =  f(p)  is 


(a,b)=(e-^=,0  +  ^= 

\  v Tl  —  3  \/  Ti  —  3 


4.  Apply  the  inverse  transformation  /  1(z)  to  get  a  confidence  interval 
for  p: 

(  e2a  —  1  e2b  —  1  \ 


+  1  ’  e2b  +  1  )  ' 


Yet  another  method  for  getting  a  confidence  interval  for  p  is  to  use  the 
bootstrap. 

14.3  Multivariate  Normal 

Recall  that  a  vector  X  has  a  multivariate  Normal  distribution,  denoted  by 
X  ~  iV(/x,  E),  if  its  density  is 

where  p  is  a  vector  of  length  k  and  E  is  a  k  x  k  symmetric,  positive  definite 
matrix.  Then  E(X)  =  p  and  V(X)  =  E. 


f(x;  /i,  S) 


(2ttW2  E  V2 


exp 


~{x-fi)T  E  1(x-Li) 


14.4  Multinomial 
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14.2  Theorem.  The  following  properties  hold: 

1.  IfZ~  N( 0, 1)  and  X  =  11  +  E 1/2Z,  then  X  ~  N(»,  E). 

2.  IfX~N(fi,  E),  fftenE-1/2(X-At)~JV(0,l). 

5.  If  X  ~  N(fi,Yi)  a  is  a  vector  of  the  same  length  as  X,  then  aTX  ~ 
N(aT fi,  crEa). 

4-  Let 

V  = 

T/ien  V  ^  Xk- 

14.3  Theorem.  Given  a  random  sample  of  size  n  from  a  X(zi,  £),  log- 
likelihood  is  (up  to  a  constant  not  depending  on  /a  or  T,)  given  by 

£(Zi,  S)  =  ~  (X  -  ZifS-^X  -  Zi)  -  |tr(E-^)  -  |  log  |S|. 

The  mle  is 

ft  =  X  and  £  =  - "j  S'.  (14.8) 


14.4  Multinomial 


Let  us  now  review  the  Multinomial  distribution.  The  data  take  the  form 
X  =  (Xi, . . . ,  Xfc)  where  each  Xj  is  a  count.  Think  of  drawing  n  balls  (with 
replacement)  from  an  urn  which  has  balls  with  k  different  colors.  In  this  case, 
Xj  is  the  number  of  balls  of  the  kth  color.  Let  p  =  (pi, . . .  ,p&)  where  pj  >  0 
and  Y^!j= i  Pj  =  1  and  suppose  that  pj  is  the  probability  of  drawing  a  ball  of 
color  j. 


14.4  Theorem.  Let  X  ~  Muffinomial(n,p).  Then  the  marginal  distribution 
°fXj  is  Xj  Binomial(n,pj).  The  mean  and  variance  of  X  are 


(  npi  \ 

E(X)  =  : 

\  npk  ) 


and 


Y(X) 


/  np\(l  —  pi) 
-npip2 


-npip2 
np2{  1  -P2) 


npiPk  \ 

np2Pk 


npiPk 


np2Pk 


•  •  • 


npk(l-Pk)  J 
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Proof.  That  Xj  ~  Binomial(n,pj)  follows  easily.  Hence,  E (Xj)  =  npj  and 
Y(Xj)  =  npj(  1  ~Pj)-  To  compute  Cov(X^,XJ)  we  proceed  as  follows:  Notice 
that  Xi+Xj  ~  Binomia \(n,pi+pj)  and  so  V(AQ+Xj)  =  n(pi+pj)(l—pi—pj). 
On  the  other  hand, 


V(^  +  Xj)  =  Y(Xi)  +  Y(Xj)  +  2Cov(AQ,  Xj) 

=  npi(  1  -  pi)  +  npj(l  -  pj )  +  2Cov(XijXj). 


Equating  this  last  expression  with  n(pi~\-pj)(l  —  p^—  pj)  implies  that  Cov(X^,  Xj) 
—npiPj.  u 


14.5  Theorem.  The  maximum  likelihood  estimator  of  p  is 


f  pi  ^ 
P  =  : 

\Pk  J 


V 


Xk 

n 


J 


x 

n 


Proof.  The  log-likelihood  (ignoring  a  constant)  is 


k 


t(p)  =  TsXiXogPi- 

3  =  1 


When  we  maximize  i  we  have  to  be  careful  since  we  must  enforce  the  con¬ 
straint  that  JN  pj  =  1.  We  use  the  method  of  Lagrange  multipliers  and  instead 
maximize 


k 


A(p)  =  E  xi  logpj  +  A 

3  = 1 


3 


Now 


dA(p) 

dPj 


V 

Pj 


~\~  A. 


Setting  =  0  yields  pj  =  —  Xj /A.  Since  JN  pj  =  1  we  see  that  A  =  —n 

and  hence  pj  =  Xj/n  as  claimed.  ■ 

Next  we  would  like  to  know  the  variability  of  the  mle.  We  can  either 
compute  the  variance  matrix  of  p  directly  or  we  can  approximate  the  vari¬ 
ability  of  the  mle  by  computing  the  Fisher  information  matrix.  These  two 
approaches  give  the  same  answer  in  this  case.  The  direct  approach  is  easy: 
Y(p)  =  Y(X/n)  =  n-2V(X),  and  so 


n 


14.5  Bibliographic  Remarks 
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where 

Pl(l~Pl)  ~PlP2  ■■■  ~PlPk 
~PlP2  P2{l~P2)  ■■■  ~P2Pk 

~PlPk  ~P2Pk  ■■■  Pk(^~Pk) 

For  large  n,  p  has  approximately  a  multivariate  Normal  distribution. 

14.6  Theorem.  As  n  -a  oo, 

Vn(p  —  p)  ^  N{ 0,  E). 

14.5  Bibliographic  Remarks 

Some  references  on  multivariate  analysis  are  Johnson  and  Wichern  (1982)  and 
Anderson  (1984).  The  method  for  constructing  the  confidence  interval  for  the 
correlation  described  in  this  chapter  is  due  to  Fisher  (1921). 

14.6  Appendix 

Proof  of  Theorem  14.3.  Denote  the  ith  random  vector  by  X1 .  The  log- 
likelihood  is 

n 

*(/*,£)  = 

i= 1 

7  1  71 

=  ~  log(27T)  -  £  log  |S|  -  -  -  p)T^~1(Xi  -  p). 

i= 1 

Now, 

n 

~  pf^i^  -  p) 

i= 1 

n 

=  ^[  ( V  -x)  +  (x-  rifx-'Kx*  -x)  +  (x-  p)} 

i= 1 
n 

=  Y  [A*  -  V)TS“1(V  -  X)]  +  n(x  -  p)T T,-\X  -  p) 

i= 1 

since  X^Li(^  —  X)'E~1(X  —  p)  =  0.  Also,  notice  that  (X1  —  p)T'E~1(Xl  —  p) 
is  a  scalar,  so 

n  n 

YiX1  -  p)T^~l  -  p)  =  ^tr^-MfE-hV-M)] 

i= 1  i= 1 


238 


14.  Multivariate  Models 


=  £  tr[E-1(Xi-/x)(Xi-/i)r] 

i= 1 

n 

E -1  -  ^)(V  -  M)r 

i=l 

=  n  tr  X-1# 

and  the  conclusion  follows.  ■ 

14.7  Exercises 

1.  Prove  Theorem  14.1. 

2.  Find  the  Fisher  information  matrix  for  the  mle  of  a  Multinomial. 

3.  (Computer  Experiment.)  Write  a  function  to  generate  nsim  observations 
from  a  Muhinomial(ri,p)  distribution. 

4.  (Computer  Experiment.)  Write  a  function  to  generate  nsim  observations 
from  a  Multivariate  normal  with  given  mean  p  and  covariance  matrix 

X. 

5.  (Computer  Experiment.)  Generate  100  random  vectors  from  a  iV(/i,  X) 
distribution  where 

Ms)-  s=0  O' 

Plot  the  simulation  as  a  scatterplot.  Estimate  the  mean  and  covariance 
matrix  X.  Find  the  correlation  p  between  X\  and  Compare  this 
with  the  sample  correlations  from  your  simulation.  Find  a  95  percent 
confidence  interval  for  p.  Use  two  methods:  the  bootstrap  and  Fisher’s 
method.  Compare. 

6.  (Computer  Experiment.)  Repeat  the  previous  exercise  1000  times.  Com¬ 
pare  the  coverage  of  the  two  confidence  intervals  for  p. 
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In  this  chapter  we  address  the  following  questions: 

(1)  How  do  we  test  if  two  random  variables  are  independent? 

(2)  How  do  we  estimate  the  strength  of  dependence  between  two 
random  variables? 

When  Y  and  Z  are  not  independent,  we  say  that  they  are  dependent  or 
associated  or  related.  If  Y  and  Z  are  associated,  it  does  not  imply  that  Y 
causes  Z  or  that  Z  causes  Y .  Causation  is  discussed  in  Chapter  16. 

Recall  that  we  write  Y II  Z  to  mean  that  Y  and  Z  are  independent  and  we 
write  Y  ZMi'  Z  to  mean  that  Y  and  Z  are  dependent. 


15.1  Two  Binary  Variables 

Suppose  that  Y  and  Z  are  both  binary  and  consider  data  (Yi,  Z i),  . . .,  (Yn,  Zn). 
We  can  represent  the  data  as  a  two-by-two  table: 


Y  =  0  Y  =  1 

Z  =  0 

X00  Vi 

x0. 

Z  =  1 

x10  Xu 

w. 

X.0  Xi 

n  =  X.. 
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where 


Xij  =  number  of  observations  for  which  Y  =  i  and  Z  =  j. 


The  dotted  subscripts  denote  sums.  Thus. 


Xi.  =  Y  Xij,  X.j  =YXiv  n  =  X-  =  Y  x> 


This  is  a  convention  we  use  throughout  the  remainder  of  the  book.  Denote 
the  corresponding  probabilities  by: 

I  Y  =  0  Y  =  1  I 


Poo 

PlO 


Poi 

Pll 


where  p^  =  F(Z  =  i,Y  =  j).  Let  X  =  (Xqo,  -Xoi,  Xio,  Xu)  denote  the  vector 
of  counts.  Then  X  ~  Multinomial(n,p)  where  p  =  (poo,Poi,Pio,Pn)-  It  is  now 
convenient  to  introduce  two  new  parameters. 


15.1  Definition.  The  odds  ratio  is  defined  to  be 

,  PooPn 

i>  = - . 

POlPlO 

The  log  odds  ratio  is  defined  to  be 

7  =  log(^). 


(15.1) 

(15.2) 


15.2  Theorem.  The  following  statements  are  equivalent: 

1.  FEZ. 

2.  fi  =  l. 

3.  7  =  0. 

4-  For  i,j  e  {o,  l},  =  pi.p.j. 

Now  consider  testing 

Hq  :  Y  II  Z  versus  Hi  :  Y  ZMP  Z.  (15.3) 

First  we  consider  the  likelihood  ratio  test.  Under  H i,  X  ^  Multinomial(n,p) 
and  the  mle  is  the  vector  p  =  Xjn.  Under  iLg,  we  again  have  that  X  ~ 
Multinomial(n,p)  but  the  restricted  mle  is  computed  under  the  constraint 
Pij  —  Pi-P-j  This  leads  to  the  following  test: 
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15.3  Theorem.  The  likelihood  ratio  test  statistic  for  (15.3)  is 


T  =  2 


2=0  j= 0  v  1  3 


(15.4) 


Under  Ho,  T  ^  x i*  Thus,  an  approximate  level  a  test  is  obtained  by 
rejecting  Hq  when  T  >  Xi,a- 


Another  popular  test  for  independence  is  Pearson’s  x2  test. 


15.4  Theorem.  Pearson’s  x 2  test  statistic  for  independence  is 


1  1 


U  = 


<  -v„  -  Eijf 


(15.5) 


2=0  j  =  0 


where 


F  _xi.x.j 

-t^ij  — 

n 


Under  Ho,  U  ^  x i*  Thus,  an  approximate  level  a  test  is  obtained  by 
rejecting  Ho  when  U  >Xl,a- 


Here  is  the  intuition  for  the  Pearson  test.  Under  Ho,  p%j  =  Pi-p.j,  so  the 
maximum  likelihood  estimator  of  pij  under  Hq  is 

Xi.  X.f 

Pij  =Pi-P-j  = - • 

n  n 

Thus,  the  expected  number  of  observations  in  the  (i,j)  cell  is 


npij 


The  statistic  U  compares  the  observed  and  expected  counts. 


15.5  Example.  The  following  data  from  Johnson  and  Johnson  (1972)  relate 
tonsillectomy  and  Hodgkins  disease.  1 


Hodgkins  Disease 

No  Disease 

Tonsillectomy 

90 

165 

255 

No  Tonsillectomy 

84 

307 

391 

Total 

174 

472 

646 

^^The  data  are  actually  from  a 

case-control  study;  see 

the  appendix  for 

an  explanation  of 

case-control  studies. 
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We  would  like  to  know  if  tonsillectomy  is  related  to  Hodgkins  disease.  The 
likelihood  ratio  statistic  is  T  =  14.75  and  the  p- value  is  ¥(xi  >  14.75)  =  .0001. 
The  x2  statistic  is  U  =  14.96  and  the  p- value  is  ¥(xi  >  14.96)  =  .0001.  We  re¬ 
ject  the  null  hypothesis  of  independence  and  conclude  that  tonsillectomy  is  as¬ 
sociated  with  Hodgkins  disease.  This  does  not  mean  that  tonsillectomies  cause 
Hodgkins  disease.  Suppose,  for  example,  that  doctors  gave  tonsillectomies  to 
the  most  seriously  ill  patients.  Then  the  association  between  tonsillectomies 
and  Hodgkins  disease  may  be  due  to  the  fact  that  those  with  tonsillectomies 
were  the  most  ill  patients  and  hence  more  likely  to  have  a  serious  disease.  ■ 


We  can  also  estimate  the  strength  of  dependence  by  estimating  the  odds 
ratio  ip  and  the  log-odds  ratio  7. 


15.6  Theorem.  The  mle’s  of  ip  and  7  are 

XooXn 


7  =  log-0. 


XmXw  ’ 

The  asymptotic  standard  errors  (computed  using  the  delta  method)  are 


(15.6) 


se(7) 

se(-0) 


(15.7) 

(15.8) 


15.7  Remark.  For  small  sample  sizes,  ip  and  7  can  have  a  very  large  variance. 
In  this  case,  we  often  use  the  modified  estimator 


(Xqq  +  1)  (In  + 

(X01  +  i)  (X10  +  i) 


(15.9) 


Another  test  for  independence  is  the  Wald  test  for  7  =  0  given  by  W  = 
(7-0)/se(7).  A  1  —  a  confidence  interval  for  7  is  7  db  za/2se{y)- 

A  1  —  a  confidence  interval  for  ip  can  be  obtained  in  two  ways.  First,  we 
could  use  ±  za/2 sefy).  Second,  since  =  e7  we  could  use 


exp  {7  ±  za/2se(7)} 


(15.10) 


This  second  method  is  usually  more  accurate. 


15.8  Example.  In  the  previous  example, 

y  90  x  307 

w  =  - 

^  165  x  84 


1.99 


and 


7  =  log(1.99)  =  .69. 
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So  tonsillectomy  patients  were  twice  as  likely  to  have  Hodgkins  disease.  The 
standard  error  of  7  is 

n.  i  i  r 

\ - 1 - 1 - 1 - =  .18. 

V  90  84  165  307 

The  Wald  statistic  is  W  =  .69/. 18  =  3.84  whose  p-value  is  P(|Z|  >  3.84)  = 
.0001,  the  same  as  the  other  tests.  A  95  per  cent  confidence  interval  for  7  is 
7±2(.18)  =  (.33, 1.05).  A  95  per  cent  confidence  interval  for  fj  is  (e*33,  e105)  = 
(1.39,2.86).  ■ 


15.2  Two  Discrete  Variables 


Now  suppose  that  Y  E  {1 ,...,/}  and  Z  E  {1, . . . ,  J}  are  two  discrete  vari¬ 
ables.  The  data  can  be  represented  as  an  I  x  J  table  of  counts: 


Z  =  1 


where 


Z  =  I 


'  =  1 

Y  =  2 

...  Y  =  j 

•••  Y  =  J 

Xn 

X12 

■  ■  ■  Xu 

Xu 

Xi 

Xu 

Xi2 

"  ^  ' 

•  •  • 

Xi. 

Xn 

Xn 

•  •  •  Xjj 

Xij 

^  .. 

1 — 1 

X .! 

X.2 

■  ■  ■  X.J 

■■■  X.J 

n 

Xij  =  number  of  observations  for  which  Z  =  i  and  Y  =  j. 
Consider  testing 

H0  :  Y  H  Z  versus  Hi  :  Y  EiWV  Z. 


(15.11) 


15.9  Theorem.  The  likelihood  ratio  test  statistic  for  (15.11)  is 


T  =  2 


1=1  j= 1  v  1  3 


(15.12) 


The  limiting  distribution  of  T  under  the  null  hypothesis  of  independence 
is  xt  where  v  =  (I  —  1)(J  —  1).  Pearson’s  y2  test  statistic  is 


1  j 


U  = 


Eij) 


i=l  j  =  l 


P%j 


(15.13) 
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Asymptotically ,  under  Ho,  U  has  a  x2  distribution  where 
v  =  -l). 


15.10  Example.  These  data  are  from  Dunsmore  et  al.  (1987).  Patients  with 
Hodgkins  disease  are  classified  by  their  response  to  treatment  and  by  histo¬ 
logical  type. 


Type 

Positive  Response 

Partial  Response 

No  Response 

LP 

74 

18 

12 

104 

NS 

68 

16 

12 

96 

MC 

154 

54 

58 

266 

LD 

18 

10 

44 

72 

The  x2  test  statistic  is  75.89  with  2x3  =  6  degrees  of  freedom.  The  p-value 
is  P(xg  >  75.89)  «  0.  The  likelihood  ratio  test  statistic  is  68.30  with  2x3  =  6 
degrees  of  freedom.  The  p-value  is  P(yg  >  68.30)  ss  0.  Thus  there  is  strong 
evidence  that  response  to  treatment  and  histological  type  are  associated.  ■ 


15.3  Two  Continuous  Variables 

Now  suppose  that  Y  and  Z  are  both  continuous.  If  we  assume  that  the  joint 
distribution  of  Y  and  Z  is  bivariate  Normal,  then  we  measure  the  dependence 
between  Y  and  Z  by  means  of  the  correlation  coefficient  p.  Tests,  estimates, 
and  confidence  intervals  for  p  in  the  Normal  case  are  given  in  the  previous 
chapter  in  Section  14.2.  If  we  do  not  assume  Normality  then  we  can  still  use  the 
methods  in  Section  14.2  to  draw  inferences  about  the  correlation  p.  However, 
if  we  conclude  that  p  is  0,  we  cannot  conclude  that  Y  and  Z  are  independent, 
only  that  they  are  uncorrelated.  Fortunately,  the  reverse  direction  is  valid: 
if  we  conclude  that  Y  and  Z  are  correlated  than  we  can  conclude  they  are 
dependent. 


15.4  One  Continuous  Variable  and  One  Discrete 

Suppose  that  Y  E  {1,...,/}  is  discrete  and  Z  is  continuous.  Let  Fi(z)  = 
¥(Z  <  z\Y  =  i)  denote  the  CDF  of  Z  conditional  on  7  =  i. 
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15.11  Theorem.  When  Y  E  { 1 }  is  discrete  and  Z  is  continuous ,  then 
Y  II  Z  if  and  only  if  =  •  •  •  =  Fj . 


It  follows  from  the  previous  theorem  that  to  test  for  independence,  we  need 
to  test 

Hq  :  F\  =  •  •  •  =  Fj  versus  Hi  :  not  Hq. 

For  simplicity,  we  consider  the  case  where  I  =  2.  To  test  the  null  hypothesis 
that  Fi  =  F2  we  will  use  the  two  sample  Kolmogorov-Smirnov  test.  Let 
ni  denote  the  number  of  observations  for  which  Yi  =  1  and  let  n2  denote  the 
number  of  observations  for  which  K  =  2.  Let 


1 


n 


Fi(z)  =  <  z)I(Yi  =  1) 


n  1  *  1 

i=i 


and 


1 


n 


F2(z)  =  —  Y,I(Zi<z)I(Yi  =  2) 

rr>  f  ^ 


n2  f  1 
1=1 


denote  the  empirical  distribution  function  of  Z  given  Y  =  1  and  Y  =  2 
respectively.  Define  the  test  statistic 


D  =  sup  \Fi(x)  —  F2 (x) 


X 


15.12  Theorem.  Let 


00 


H(t)  =  l-2£(-l) 

3  = 1 

Under  the  null  hypothesis  that  F\  =  F2, 


j-le-2ft2 


(15.14) 


lim  P  I  J  111712  D  <t  \  =  H(t). 

n— >-00  V  \/  m  +  77,2 


It  follows  from  the  theorem  that  an  approximate  level  a  test  is  obtained  by 
rejecting  Hq  when 


n\n2 
n\  +  n2 


D>H~1( l -a) 


15.5  Appendix 

Interpreting  The  Odds  Ratios.  Suppose  event  A  as  probability  ¥(A). 
The  odds  of  A  are  defined  as  odds(A)  =  P(A)/(1  —  P(A)).  It  follows  that 
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P(A)  =  odds(A)/(l  +  odds(A)).  Let  E  be  the  event  that  someone  is  exposed 
to  something  (smoking,  radiation,  etc)  and  let  D  be  the  event  that  they  get 
a  disease.  The  odds  of  getting  the  disease  given  that  you  are  exposed  are: 


odds(T>|£)  = 


¥(D\E) 

1-¥(D\E) 


and  the  odds  of  getting  the  disease  given  that  you  are  not  exposed  are: 


odds(D\Ec)  = 


F(D\EC) 

1-F(D\EC) 


The  odds  ratio  is  defined  to  be 


V’ 


odds(D|^) 

odds(D\Ec ) 


If  ip  =  1  then  disease  probability  is  the  same  for  exposed  and  unexposed.  This 
implies  that  these  events  are  independent.  Recall  that  the  log-odds  ratio  is 
defined  as  7  =  log('0).  Independence  corresponds  to  7  =  0. 

Consider  this  table  of  probabilities  and  corresponding  table  of  data: 


Dc 

D 

Dc 

D 

Ec 

Poo 

P01 

Po- 

Ec 

Wo 

X01 

x0. 

E 

P10 

Pll 

Pi- 

E 

Wo 

Xu 

Xi. 

P-0 

P-1 

1 

X.o 

X .! 

X. 

Now 


¥(D\E) 


V 11 

P10  +P11 


and  F(D\EC) 


P01 

Poo  +  P01  ’ 


and  so 


odds(D\E)  =  — 

P 10 


and  odd$(D\Ec)  =  ^21, 

Poo 


and  therefore, 

,  P11P00 

V  = - • 

P01P10 


To  estimate  the  parameters,  we  have  to  first  consider  how  the  data  were 
collected.  There  are  three  methods. 

Multinomial  Sampling.  We  draw  a  sample  from  the  population  and, 
for  each  person,  record  their  exposure  and  disease  status.  In  this  case,  X  = 
(X00,Xoi,Xio,Xii)  ^  Multinomial(n,p).  We  then  estimate  the  probabilities 
in  the  table  by  pij  =  jn  and 


P11P00 

P01P10 


XnXpo 
V)iX10 ' 
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Prospective  Sampling.  (Cohort  Sampling).  We  get  some  exposed  and 
unexposed  people  and  count  the  number  with  disease  in  each  group.  Thus, 


Xqi  ~  Binomial(X0.,P(T>|Ec)) 

Xn  ^  Binomial(Xi.,  ¥(D\E)). 

We  should  really  write  xq.  and  x\.  instead  of  Xq.  and  X\.  since  in  this  case, 
these  are  hxed  not  random,  but  for  notational  simplicity  I’ll  keep  using  capital 
letters.  We  can  estimate  ¥(D\E)  and  F(D\EC)  but  we  cannot  estimate  all  the 
probabilities  in  the  table.  Still,  we  can  estimate  ip  since  ^  is  a  function  of 
¥(D\E)  and  ¥(D\EC).  Now 


F(D\E) 


Xn 


and  F(D\EC) 


Xoi 

X0/ 


Thus, 


XnXoo 

X0iX10 


just  as  before. 

Case-Control  (Retrospective)  Sampling.  Here  we  get  some  diseased 
and  non-diseased  people  and  we  observe  how  many  are  exposed.  This  is  much 
more  efficient  if  the  disease  is  rare.  Hence, 


Xio  -  Binomial(X.0,IP(£|T>c)) 
Xn  ^  Binomial(X.i,  ¥(E\D)). 


From  these  data  we  can  estimate  ¥(E\D)  and  ¥(E\DC).  Surprisingly,  we  can 
also  still  estimate  ip.  To  understand  why,  note  that 

P (E\D)  =  — — — ,  1  -  W>(E\D)  =  — — — ,  oddsfEID)  =  — . 

Pol  +Pll  Pol  +Pll  Pol 


By  a  similar  argument, 

odds(£|Dc)  =  Pw 

Poo 

Hence, 

odds(-E|D)  _  pupoo 

odds (E  Dc)  poiPio 

From  the  data,  we  form  the  following  estimates: 


P(E\D ) 


X 


li 


X. 


1-P(E\D) 


X, 


01 


X. 


odds(F?|,D) 


X 


li 


X, 


01 


odds(E\Dc) 


X 


10 


X, 


00 


Therefore, 


X00Xn 

X0iX10' 
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So  in  all  three  data  collection  methods,  the  estimate  of  ip  turns  out  to  be  the 
same. 

It  is  tempting  to  try  to  estimate  F(D\E)  —  ¥(D\EC).  In  a  case-control  design, 
this  quantity  is  not  estimable.  To  see  this,  we  apply  Bayes’  theorem  to  get 


F(D\E)  -F(D\EC) 


¥(E\D)¥(D)  F(EC\D)¥(D) 

F(E)  ¥(EC) 


Because  of  the  way  we  obtained  the  data,  F(D)  is  not  estimable  from  the  data. 
However,  we  can  estimate  £  =  ¥(D\E) /¥(D\EC) ,  which  is  called  the  relative 
risk,  under  the  rare  disease  assumption. 


15.13  Theorem.  Let  £  =  ¥(D\E)/¥(D\EC).  Then 


as  P (D)  0. 


Thus,  under  the  rare  disease  assumption,  the  relative  risk  is  approximately 
the  same  as  the  odds  ratio  and,  as  we  have  seen,  we  can  estimate  the  odds 
ratio. 


15.6  Exercises 

1.  Prove  Theorem  15.2. 

2.  Prove  Theorem  15.3. 

3.  Prove  Theorem  15.6. 

4.  The  New  York  Times  (January  8,  2003,  page  A12)  reported  the  following 
data  on  death  sentencing  and  race,  from  a  study  in  Maryland:  2 

Death  Sentence  No  Death  Sentence 
Black  Victim  14  641 

White  Victim  62  594 

Analyze  the  data  using  the  tools  from  this  chapter.  Interpret  the  results. 
Explain  why,  based  only  on  this  information,  you  can’t  make  causal 
conclusions.  (The  authors  of  the  study  did  use  much  more  information 
in  their  full  report.) 


2 


The  data  here  are  an  approximate  re-creation  using  the  information  in  the  article. 
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5.  Analyze  the  data  on  the  variables  Age  and  Financial  Status  from: 
htt  p :  /  /lib .  st  at .  emu .  edu/  DA  S  L  /  D  at  ahles  /  mont  anadat .  ht  ml 

6.  Estimate  the  correlation  between  temperature  and  latitude  using  the 
data  from 

http://lib.stat.cmu.edu/DASL/Datahles/USTemperatures.html 

Use  the  correlation  coefficient.  Provide  estimates,  tests,  and  confidence 
intervals. 

7.  Test  whether  calcium  intake  and  drop  in  blood  pressure  are  associated. 
Use  the  data  in 

http:  / /lib.  stat.cmu.edu/DASL/Datahles/Calcium.html 


16 

Causal  Inference 


Roughly  speaking,  the  statement  “X  causes  T”  means  that  changing  the 
value  of  X  will  change  the  distribution  of  Y .  When  X  causes  T,  X  and  Y 
will  be  associated  but  the  reverse  is  not,  in  general,  true.  Association  does  not 
necessarily  imply  causation.  We  will  consider  two  frameworks  for  discussing 
causation.  The  first  uses  counterfactual  random  variables.  The  second,  pre¬ 
sented  in  the  next  chapter,  uses  directed  acyclic  graphs. 


16.1  The  Counterfactual  Model 

Suppose  that  X  is  a  binary  treatment  variable  where  X  =  1  means  “treated” 
and  X  =  0  means  “not  treated.”  We  are  using  the  word  “treatment”  in  a 
very  broad  sense.  Treatment  might  refer  to  a  medication  or  something  like 
smoking.  An  alternative  to  “treated/not  treated”  is  “exposed/not  exposed” 
but  we  shall  use  the  former. 

Let  Y  be  some  outcome  variable  such  as  presence  or  absence  of  disease. 
To  distinguish  the  statement  “X  is  associated  T”  from  the  statement  “X 
causes  T”  we  need  to  enrich  our  probabilistic  vocabulary.  Specifically,  we  will 
decompose  the  response  Y  into  a  more  fine-grained  object. 

We  introduce  two  new  random  variables  (Co,Ci),  called  potential  out¬ 
comes  with  the  following  interpretation:  Cq  is  the  outcome  if  the  subject  is 
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not  treated  ( X  =  0)  and  C\  is  the  outcome  if  the  subject  is  treated  (X  =  1). 
Hence, 


f  C0  if  X  =  0 
\  Ci  if  X  =  1. 

We  can  express  the  relationship  between  Y  and  (Cq,Ci)  more  succinctly  by 


F  =  CX.  (16.1) 

Equation  (16.1)  is  called  the  consistency  relationship. 

Here  is  a  toy  dataset  to  make  the  idea  clear: 


X 

Y 

Co 

Ci 

0 

4 

4 

* 

0 

7 

7 

* 

0 

2 

2 

* 

0 

8 

8 

* 

1 

3 

* 

3 

1 

5 

* 

5 

1 

8 

* 

8 

1 

9 

* 

9 

The  asterisks  denote  unobserved  values.  When  X  =  0  we  don’t  observe  C i, 
in  which  case  we  say  that  C\  is  a  counterfactual  since  it  is  the  outcome 
you  would  have  had  if,  counter  to  the  fact,  you  had  been  treated  (X  =  1). 
Similarly,  when  X  =  1  we  don’t  observe  Co,  and  we  say  that  Co  is  counter- 
factual.  There  are  four  types  of  subjects: 


Type _ Cp  Ci 

Survivors  1  1 

Responders  0  1 

Anti-responders  1  0 

Doomed  0  0 


Think  of  the  potential  outcomes  (Co,  C\)  as  hidden  variables  that  contain  all 
the  relevant  information  about  the  subject. 

Define  the  average  causal  effect  or  average  treatment  effect  to  be 


9  =  E(Ci)  -E(C0).  (16.2) 

The  parameter  9  has  the  following  interpretation:  9  is  the  mean  if  everyone 
were  treated  (X  =  1)  minus  the  mean  if  everyone  were  not  treated  (X  =  0). 
There  are  other  ways  of  measuring  the  causal  effect.  For  example,  if  Co  and 
Ci  are  binary,  we  define  the  causal  odds  ratio 

P(Ci  =  1)  .  P(C0  =  1) 

P(Ci  =  0)  “  P(C0  =  0) 
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and  the  causal  relative  risk 

P(Ci  =  1) 

E(C0  =  1) ' 

The  main  ideas  will  be  the  same  whatever  causal  effect  we  use.  For  simplicity, 
we  shall  work  with  the  average  causal  effect  6. 

Define  the  association  to  be 

a  =  E(Y\X  =  1)  -  E(Y\X  =  0).  (16.3) 

Again,  we  could  use  odds  ratios  or  other  summaries  if  we  wish. 

16.1  Theorem  (Association  Is  Not  Causation).  In  general,  0  ^  a. 

16.2  Example.  Suppose  the  whole  population  is  as  follows: 

X  Y  C0  Cx 

~~0  0  0 

0  0  0  0* 

0  0  0  0* 

0  0  0  0* 

1  I  P  I 

1  1  1*  1 

1  1  1*  1 

1  1  1*  1 

Again,  the  asterisks  denote  unobserved  values.  Notice  that  Co  =  C\  for  every 
subject,  thus,  this  treatment  has  no  effect.  Indeed, 

1  8  1  8 

0  =  E(C1)-E(C0)  =  -^Cll--^C0l 

i= 1  i=  1 

_  0+0+0+0+1+1+1+1  0+0+0+0+ 1 + 1 + 1 + 1 

8  8 

=  0. 

Thus,  the  average  causal  effect  is  0.  The  observed  data  are  only  the  X’s  and 
M’s,  from  which  we  can  estimate  the  association: 

ce  =  E(Y\X  =  1)  -E(Y\X  =  0) 

_  l+l+l+l  0+0+0+0_ 

_I_  • 

4  4 

Hence,  0  /  a. 

To  add  some  intuition  to  this  example,  imagine  that  the  outcome  variable 
is  1  if  “healthy”  and  0  if  “sick” .  Suppose  that  X  =  0  means  that  the  subject 
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does  not  take  vitamin  C  and  that  X  =  1  means  that  the  subject  does  take 
vitamin  C.  Vitamin  C  has  no  causal  effect  since  Co  =  C\  for  each  subject.  In 
this  example  there  are  two  types  of  people:  healthy  people  (Co,Ci)  =  (1,1) 
and  unhealthy  people  (Co,Ci)  =  (0,0).  Healthy  people  tend  to  take  vitamin 
C  while  unhealthy  people  don’t.  It  is  this  association  between  (Co,Ci)  and 
X  that  creates  an  association  between  X  and  Y.  If  we  only  had  data  on  X 
and  Y  we  would  conclude  that  X  and  Y  are  associated.  Suppose  we  wrongly 
interpret  this  causally  and  conclude  that  vitamin  C  prevents  illness.  Next  we 
might  encourage  everyone  to  take  vitamin  C.  If  most  people  comply  with  our 
advice,  the  population  will  look  something  like  this: 


X 

Y 

c0 

Cl 

0 

0 

0 

0* 

1 

0 

0 

0* 

1 

0 

0 

0* 

1 

0 

0 

0* 

1 

1 

■J^  * 

1 

1 

1 

2  * 

1 

1 

1 

■J^  * 

1 

1 

1 

■J^  * 

1 

Now  a  =  (4/7)  —  (0/1)  =  4/7.  We  see  that  a  went  down  from  1  to  4/7. 
Of  course,  the  causal  effect  never  changed  but  the  naive  observer  who  does 
not  distinguish  association  and  causation  will  be  confused  because  his  advice 
seems  to  have  made  things  worse  instead  of  better.  ■ 

In  the  last  example,  9  =  0  and  a  =  1.  It  is  not  hard  to  create  examples  in 
which  a  >  0  and  yet  6  <  0.  The  fact  that  the  association  and  causal  effects 
can  have  different  signs  is  very  confusing  to  many  people. 

The  example  makes  it  clear  that,  in  general,  we  cannot  use  the  association 
to  estimate  the  causal  effect  6.  The  reason  that  9  ^  a  is  that  (Co,Ci)  was 
not  independent  of  X.  That  is,  treatment  assignment  was  not  independent  of 
person  type. 

Can  we  ever  estimate  the  causal  effect?  The  answer  is:  sometimes.  In  par¬ 
ticular,  random  assignment  to  treatment  makes  it  possible  to  estimate  9. 

16.3  Theorem.  Suppose  we  randomly  assign  subjects  to  treatment  and  that 
P(X  =  0)  >  0  and  P(X  =  1)  >  0.  Then  a  =  9.  Hence,  any  consistent  estima¬ 
tor  of  a  is  a  consistent  estimator  of  9.  In  particular,  a  consistent  estimator 
is 


9  =  1(V|X  =  1)-1(V|X  =  0) 
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=  Y!-Y0 

is  a  consistent  estimator  of  6 ,  where 

n  1  n 

Y1  =  —y/Yixi,  Y0  =  —  Viyi-Xi), 

ni  “  n0  ^ 

1=1  1=1 

ni  =  Yh= i  xi>  and  no  =  E”=i(l  -  Xi)- 

Proof.  Since  X  is  randomly  assigned,  X  is  independent  of  (Co,  Ci).  Hence, 

<9  =  E(Ci)-E(C0) 

=  E(Ci|X  =  1)  -  E(C0 \X  =  0)  since  X  II  (C0,  Cx) 

=  E(y  |X  =  1)  -  E(y|X  =  0)  since  Y  =  Cx 

=  a. 

The  consistency  follows  from  the  law  of  large  numbers.  ■ 

If  Z  is  a  covariate,  we  define  the  conditional  causal  effect  by 

6Z  =  E(Ci|Z  =  z)~  E(Co\Z  =  z). 

For  example,  if  Z  denotes  gender  with  values  Z  =  0  (women)  and  Z  =  1 
(men),  then  0q  is  the  causal  effect  among  women  and  0\  is  the  causal  effect 
among  men.  In  a  randomized  experiment,  9Z  =  E(P|X  =  1,  Z  =  z)—  E(P|X  = 
0,  Z  =  z)  and  we  can  estimate  the  conditional  causal  effect  using  appropriate 
sample  averages. 

Summary  of  the  Counterfactual  Model 

Random  variables:  (Co,  Ci,  X,  Y). 

Consistency  relationship:  Y  =  Cx- 
Causal  Effect:  6  =  E(Ci)  —  E(Co). 

Association:  a  =  E(E|X  =  1)  —  E(E|X  =  0). 

Random  assignment  =^>  (Cq,Ci)HX  =>  0  =  a. 


16.2  Beyond  Binary  Treatments 

Let  us  now  generalize  beyond  the  binary  case.  Suppose  that  X  <G  A.  For 
example,  X  could  be  the  dose  of  a  drug  in  which  case  IgR.  The  counterfac¬ 
tual  vector  (Cq,  Ci)  now  becomes  the  counterfactual  function  C(x)  where 
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FIGURE  16.1.  A  counterfactual  function  C(x).  The  outcome  Y  is  the  value  of  the 
curve  C(x)  evaluated  at  the  observed  dose  X. 

C(x)  is  the  outcome  a  subject  would  have  if  he  received  dose  x.  The  observed 
response  is  given  by  the  consistency  relation 

Y  =  C(X).  (16.4) 

See  Figure  16.1.  The  causal  regression  function  is 

Q(x)  =  E(C(x)).  (16.5) 

The  regression  function,  which  measures  association,  is  r{x)  =  1&(Y\X  =  x). 

16.4  Theorem.  In  general ,  Q(x)  /  r(x).  However,  when  X  is  randomly  as¬ 
signed,  0(x)  =  r(x). 

16.5  Example.  An  example  in  which  6(x)  is  constant  but  r(x)  is  not  constant 
is  shown  in  Figure  16.2.  The  figure  shows  the  counterfactual  functions  for 
four  subjects.  The  dots  represent  their  X  values  Xi,  X2,  X%,  X4.  Since  Ci(x) 
is  constant  over  x  for  all  i,  there  is  no  causal  effect  and  hence 

nf  \  _  Ci(x)  +  C2  (x)  +  Cs(x)  +  C±(x) 

U\x)  —  Z 
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is  constant.  Changing  the  dose  x  will  not  change  anyone’s  outcome.  The  four 
dots  in  the  lower  plot  represent  the  observed  data  points  Y\  =  Ci(Xi),  Y2  = 
£2(^2),  I3  =  C3 (A3) ,  Y4  =  C^X^).  The  dotted  line  represents  the  regression 
r(x)  =  E(Y\X  =  x).  Although  there  is  no  causal  effect,  there  is  an  association 
since  the  regression  curve  r(x)  is  not  constant.  ■ 


16.3  Observational  Studies  and  Confounding 


A  study  in  which  treatment  (or  exposure)  is  not  randomly  assigned  is  called  an 
observational  study.  In  these  studies,  subjects  select  their  own  value  of  the 
exposure  X.  Many  of  the  health  studies  you  read  about  in  the  newspaper  are 
like  this.  As  we  saw,  association  and  causation  could  in  general  be  quite  differ¬ 
ent.  This  discrepancy  occurs  in  non-randomized  studies  because  the  potential 
outcome  C  is  not  independent  of  treatment  X.  However,  suppose  we  could 
find  groupings  of  subjects  such  that,  within  groups,  X  and  { C(x )  :  x  E  X} 
are  independent.  This  would  happen  if  the  subjects  are  very  similar  within 
groups.  For  example,  suppose  we  find  people  who  are  very  similar  in  age,  gen¬ 
der,  educational  background,  and  ethnic  background.  Among  these  people  we 
might  feel  it  is  reasonable  to  assume  that  the  choice  of  X  is  essentially  ran¬ 
dom.  These  other  variables  are  called  confounding  variables.1  If  we  denote 
these  other  variables  collectively  as  Z,  then  we  can  express  this  idea  by  saying 
that 


{C(x) 


x  e  X}UX\ Z. 


(16.6) 


Equation  (16.6)  means  that,  within  groups  of  Z,  the  choice  of  treatment  X 
does  not  depend  on  type,  as  represented  by  {C(x)  :  x  E  X}.  If  (16.6)  holds 
and  we  observe  Z  then  we  say  that  there  is  no  unmeasured  confounding. 


16.6  Theorem.  Suppose  that  (16.6)  holds.  Then, 


6{x)  =  /  E(Y\X  =  x,  Z  =  z)dFz(z)dz.  (16.7) 


Ifr(x,z)  is  a  consistent  estimate  of  the  regression  function  E(T|X  —  x,Z  — 
z),  then  a  consistent  estimate  of  6{x)  is 


1 

n 


n 

E r(x,Zi ). 

i= 1 


1 


A  more  precise  definition  of  confounding  is  given  in  the  next  chapter. 
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FIGURE  16.2.  The  top  plot  shows  the  counterfactual  function  C(pc)  for  four  sub¬ 
jects.  The  dots  represent  their  X  values.  Since  C%{x)  is  constant  over  x  for  all  i,  there 
is  no  causal  effect.  Changing  the  dose  will  not  change  anyone’s  outcome.  The  lower 
plot  shows  the  causal  regression  function  0(pc)  =  (C\(x)  -\-C2(x)  -\-Cz(x)  -\-C^{x)) / A. 
The  four  dots  represent  the  observed  data  points  Yi  =  Ci(Xi),  Y2  = 

Y3  =  C^Xz),  Y4  =  04^X4).  The  dotted  line  represents  the  regression 
r(pc)  =  E(y|X  =  x).  There  is  no  causal  effect  since  C%{x)  is  constant  for  all  i. 
But  there  is  an  association  since  the  regression  curve  r(x)  is  not  constant. 
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In  particular ,  if  r(x,  z)  =  (3 o  +  fax  +  faz  is  linear ,  then  a  consistent  estimate 
of  6(x)  is 

O^x)  =  /3q  +  /3ix  +  /32Zn  (16.8) 

where  (A), Al, A2)  are  t/ie  least  squares  estimators. 

16.7  Remark.  It  is  useful  to  compare  equation  (16.7)  to  E(Y|X  =  x)  which 
can  be  written  as  E(Y|X  =  x)  =  J*E(Y|X  =  x,  Z  =  z)dFz \x(z\x). 

Epidemiologists  call  (16.7)  the  adjusted  treatment  effect.  The  process  of 
computing  adjusted  treatment  effects  is  called  adjusting  (or  controlling) 
for  confounding.  The  selection  of  what  confounders  Z  to  measure  and  con¬ 
trol  for  requires  scientific  insight.  Even  after  adjusting  for  confounders,  we 
cannot  be  sure  that  there  are  not  other  confounding  variables  that  we  missed. 
This  is  why  observational  studies  must  be  treated  with  healthy  skepticism. 
Results  from  observational  studies  start  to  become  believable  when:  (i)  the 
results  are  replicated  in  many  studies,  (ii)  each  of  the  studies  controlled  for 
plausible  confounding  variables,  (iii)  there  is  a  plausible  scientific  explanation 
for  the  existence  of  a  causal  relationship. 

A  good  example  is  smoking  and  cancer.  Numerous  studies  have  shown  a 
relationship  between  smoking  and  cancer  even  after  adjusting  for  many  con¬ 
founding  variables.  Moreover,  in  laboratory  studies,  smoking  has  been  shown 
to  damage  lung  cells.  Finally,  a  causal  link  between  smoking  and  cancer  has 
been  found  in  randomized  animal  studies.  It  is  this  collection  of  evidence 
over  many  years  that  makes  this  a  convincing  case.  One  single  observational 
study  is  not,  by  itself,  strong  evidence.  Remember  that  when  you  read  the 
newspaper. 


16.4  Simpson’s  Paradox 

Simpson’s  paradox  is  a  puzzling  phenomenon  that  is  discussed  in  most  statis¬ 
tics  texts.  Unfortunately,  most  explanations  are  confusing  (and  in  some  cases 
incorrect).  The  reason  is  that  it  is  nearly  impossible  to  explain  the  paradox 
without  using  counterfactuals  (or  directed  acyclic  graphs). 

Let  X  be  a  binary  treatment  variable,  Y  a  binary  outcome,  and  Z  a  third 
binary  variable  such  as  gender.  Suppose  the  joint  distribution  of  X,  Y,  Z  is 
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y  =  i 

y  =  o 

y  =  i 

y  =  o 

X  =  1 

.1500 

.2250 

.1000 

.0250 

X  =  0 

.0375 

.0875 

.2625 

.1125 

Z  =  1  (men) 

Z  =  0  ( 

women) 

The  marginal  distribution  for  (X,  Y)  is 


Y  =  1 

y  =  o 

X  =  1 

.25 

.25 

.50 

X  =  0 

.30 

.20 

.50 

.55 

.45 

1 

From  these  tables  we  find  that, 


¥(Y  =  1|X  =  1)  -  ¥(Y  =  1\X  =  0) 

¥(Y  =  1\X  =  1,  Z  =  1)  -  ¥(Y  =  1\X  =  0,  Z  =  1) 

¥(Y  =  1\X  =  1,  Z  =  0)  -  ¥(Y  =  1\X  =  0,  Z  =  0) 


To  summarize,  we  seem  to  have  the  following  information: 


0.1 

0.1 

0.1. 


Mathematical  Statement 


p(y  =  i|x  =  i)  <  p(y  =  i|x 
p(y  =  i|x  =  i,z  =  l)  >  p(y  =  i|x 
p(y  =  i\x  =  i,z  =  o)  >  p(y  =  i|x 


0) 

0,z=l) 

0,  Z  =  0) 


English  Statement? 
treatment  is  harmful 
treatment  is  beneficial  to  men 
treatment  is  beneficial  to  women 


Clearly,  something  is  amiss.  There  can’t  be  a  treatment  which  is  good  for 
men,  good  for  women,  but  bad  overall.  This  is  nonsense.  The  problem  is  with 
the  set  of  English  statements  in  the  table.  Our  translation  from  math  into 
English  is  specious. 


The  inequality  P(y 


1  X 


i)  <  p(y 


mean  that  treatment  is  harmful. 


1  X 


0)  does  not 


The  phrase  “treatment  is  harmful”  should  be  written  mathematically  as 
P(Ci  =  1)  <  P(Co  =  1).  The  phrase  “treatment  is  harmful  for  men”  should 
be  written  P(Ci  =  1| Z  =  1)  <  P(Co  =  1| Z  =  1).  The  three  mathematical 
statements  in  the  table  are  not  at  all  contradictory.  It  is  only  the  translation 
into  English  that  is  wrong. 

Let  us  now  show  that  a  real  Simpson’s  paradox  cannot  happen,  that  is, 
there  cannot  be  a  treatment  that  is  beneficial  for  men  and  women  but  harmful 
overall.  Suppose  that  treatment  is  beneficial  for  both  sexes.  Then 


P(Ci  =  1\Z  =  z)>  P(G)  =  1\Z  =  z ) 
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for  all  z.  It  then  follows  that 


P(Ci  =  1) 


> 


P(C'0  =  1|  Z 


nc0  =  i). 


z)P(Z  =  z) 
z)¥(Z  =  z) 


Hence,  P(C'i  =  1)  >  PfC'o  =  1),  so  treatment  is  beneficial  overall.  No  paradox. 


16.5  Bibliographic  Remarks 

The  use  of  potential  outcomes  to  clarify  causation  is  due  mainly  to  Jerzy  Ney- 
man  and  Donald  Rubin.  Later  developments  are  due  to  Jamie  Robins,  Paul 
Rosenbaum,  and  others.  A  parallel  development  took  place  in  econometrics 
by  various  people  including  James  Heckman  and  Charles  Manski.  Texts  on 
causation  include  Pearl  (2000),  Rosenbaum  (2002),  Spirtes  et  al.  (2000),  and 
van  der  Laan  and  Robins  (2003). 


16.6  Exercises 

1.  Create  an  example  like  Example  16.2  in  which  a  >  0  and  0  <  0. 

2.  Prove  Theorem  16.4. 

3.  Suppose  you  are  given  data  (Xi,  Yi), . . . ,  (Xn,  Yn)  from  an  observational 
study,  where  Xi  E  {0, 1}  and  Yi  E  {0, 1}.  Although  it  is  not  possible 
to  estimate  the  causal  effect  0,  it  is  possible  to  put  bounds  on  6.  Find 
upper  and  lower  bounds  on  6  that  can  be  consistently  estimated  from 
the  data.  Show  that  the  bounds  have  width  1. 

Hint:  Note  that  E(Ci)  =  E(Ci|X  =  1)P(X  =  1)  +E(Ci|X  =  0)P(X  = 

0). 

4.  Suppose  that  X  E  M  and  that,  for  each  subject  i,  Ci(x)  =  fiux.  Each 
subject  has  their  own  slope  (3u.  Construct  a  joint  distribution  on  (/A,  X) 
such  that  P(/A  >  0)  =  1  but  E(Y|X  =  x)  is  a  decreasing  function  of  x, 
where  Y  =  C(X).  Interpret. 

5.  Let  X  E  {0, 1}  be  a  binary  treatment  variable  and  let  (Co,Ci)  denote 
the  corresponding  potential  outcomes.  Let  Y  =  Cx  denote  the  observed 
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response.  Let  Fo  and  F\  be  the  cumulative  distribution  functions  for 
Co  and  C\.  Assume  that  Fq  and  are  both  continuous  and  strictly 
increasing.  Let  9  =  mi  —  mo  where  mo  =  F0_1(  1/2)  is  the  median  of  Co 
and  mi  =  F1_1(l/2)  is  the  median  of  Ci.  Suppose  that  the  treatment  X 
is  assigned  randomly.  Find  an  expression  for  9  involving  only  the  joint 
distribution  of  X  and  Y. 


17 

Directed  Graphs  and  Conditional 
Independence 


17.1  Introduction 

A  directed  graph  consists  of  a  set  of  nodes  with  arrows  between  some  nodes. 
An  example  is  shown  in  Figure  17.1. 

Graphs  are  useful  for  representing  independence  relations  between  variables. 
They  can  also  be  used  as  an  alternative  to  counterfactuals  to  represent  causal 
relationships.  Some  people  use  the  phrase  Bayesian  network  to  refer  to  a 
directed  graph  endowed  with  a  probability  distribution.  This  is  a  poor  choice 
of  terminology.  Statistical  inference  for  directed  graphs  can  be  performed  using 


Y 


FIGURE  17.1.  A  directed  graph  with  vertices  V  =  {A,  Y,  Z}  and  edges 

E  =  {(Y,X),(Y,Z)}. 
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frequentist  or  Bayesian  methods,  so  it  is  misleading  to  call  them  Bayesian 
networks. 

Before  getting  into  details  about  directed  acyclic  graphs  (DAGs),  we  need 
to  discuss  conditional  independence. 

17.2  Conditional  Independence 

17.1  Definition.  Let  X ,  Y  and  Z  be  random  variables.  X  and  Y  are 

conditionally  independent  given  Z ,  written  X  II  Y  \  Z 7  if 

fx,Y\z(x,y\z )  =  fx\z(x\z)fY\z(y\z)-  (17.1) 

for  all  x ,  y  and  z. 

Intuitively,  this  means  that,  once  you  know  Z,  Y  provides  no  extra  infor¬ 
mation  about  X.  An  equivalent  definition  is  that 

f(x\y,z)  =  f(x\z).  (17.2) 

The  conditional  independence  relation  satisfies  some  basic  properties. 

17.2  Theorem.  The  following  implications  hold:  1 

XUY  \Z 
X  II  Y  I  Z  and  U  =  h(X) 

XUY\  Z  and  U  =  h(X) 

XUY  l  Z  and  XUW\(Y,Z) 

XUY  \Z  and  XUZ  \Y 

17.3  DAGs 

A  directed  graph  Q  consists  of  a  set  of  vertices  V  and  an  edge  set  E  of 
ordered  pairs  of  vertices.  For  our  purposes,  each  vertex  will  correspond  to  a 
random  variable.  If  (X,  Y)  E  E  then  there  is  an  arrow  pointing  from  X  to  Y. 
See  Figure  17.1. 

1_rhe  last  property  requires  the  assumption  that  all  events  have  positive  probability;  the  first 
four  do  not. 
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overweight 


smoking 


FIGURE  17.2.  DAG  for  Example  17.4. 

If  an  arrow  connects  two  variables  X  and  Y  (in  either  direction)  we  say 
that  X  and  Y  are  adjacent.  If  there  is  an  arrow  from  X  to  Y  then  X  is  a 
parent  of  Y  and  Y  is  a  child  of  X.  The  set  of  all  parents  of  X  is  denoted 
by  7 Tx  or  7r(X).  A  directed  path  between  two  variables  is  a  set  of  arrows 
all  pointing  in  the  same  direction  linking  one  variable  to  the  other  such  as: 

X  - ^  •  •  •  - ^  Y 


A  sequence  of  adjacent  vertices  staring  with  X  and  ending  with  Y  but 
ignoring  the  direction  of  the  arrows  is  called  an  undirected  path.  The  se¬ 
quence  {X,  Y,  Z}  in  Figure  17.1  is  an  undirected  path.  X  is  an  ancestor  of 
Y  if  there  is  a  directed  path  from  X  to  Y  (or  X  =  Y).  We  also  say  that  Y  is 
a  descendant  of  X. 

A  configuration  of  the  form: 

X  - ^  Y  -  Z 


is  called  a  collider  at  Y.  A  configuration  not  of  that  form  is  called  a  non¬ 
collider,  for  example, 


X 


Y 


Z 
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or 


X 


Y 


Z 


The  collider  property  is  path  dependent.  In  Figure  17.7,  Y  is  a  collider  on 
the  path  {X,  Y,  Z}  but  it  is  a  non-collider  on  the  path  {X,  Y,  W}.  When  the 
variables  pointing  into  the  collider  are  not  adjacent,  we  say  that  the  collider 
is  unshielded.  A  directed  path  that  starts  and  ends  at  the  same  variable  is 
called  a  cycle.  A  directed  graph  is  acyclic  if  it  has  no  cycles.  In  this  case  we 
say  that  the  graph  is  a  directed  acyclic  graph  or  DAG.  From  now  on,  we 
only  deal  with  acyclic  graphs. 


17.4  Probability  and  DAGs 


Let  Q  be  a  DAG  with  vertices  V  =  (Xi, . . . ,  X&). 


17.3  Definition.  IfP  is  a  distribution  for  V  with  probability  function  f , 
we  say  that  P  is  Markov  to  Q,  or  that  Q  represents  P,  if 


k 


f(v)  =  II /(Zj  I  7 Ti) 


i= 1 


(17.3) 


where  tti  are  the  parents  of  Xi.  The  set  of  distributions  represented  by  Q 
is  denoted  by  M(Q). 


17.4  Example.  Figure  17.2  shows  a  DAG  with  four  variables.  The  probability 
function  for  this  example  factors  as 

/(overweight,  smoking,  heart  disease,  cough) 

=  /(overweight)  x  /( smoking) 
x  /(heart  disease  |  overweight,  smoking) 
x  /(cough  |  smoking).  ■ 

17.5  Example.  For  the  DAG  in  Figure  17.3,  P  e  M(Q)  if  and  only  if  its 
probability  function  /  has  the  form 


f(x,y,z,w)  =  f{x)f(y)f(z  \  x,y)f(w  \  z). 
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FIGURE  17.3.  Another  DAG. 

The  following  theorem  says  that  F  E  M(Q)  if  and  only  if  the  Markov 
Condition  holds.  Roughly  speaking,  the  Markov  Condition  means  that  every 
variable  W  is  independent  of  the  “past”  given  its  parents. 

17.6  Theorem.  A  distribution  P  E  M(Q)  if  and  only  if  the  following  Markov 
Condition  holds:  for  every  variable  W , 

WU  W  I  7TW  (17.4) 

where  W  denotes  all  the  other  variables  except  the  parents  and  descendants 

of  W . 

17.7  Example.  In  Figure  17.3,  the  Markov  Condition  implies  that 

1117  and  Til  {X,  Y}\  Z.  m 

17.8  Example.  Consider  the  DAG  in  Figure  17.4.  In  this  case  probability 
function  must  factor  like 

f(a,b,c,d,e)  =  f(a)f(b\a)f(c\a)f(d\b,c)f(e\d). 

The  Markov  Condition  implies  the  following  independence  relations: 

DUA\{B,C},  EU{A,B,C}\D  and  BUC\A  m 

17.5  More  Independence  Relations 

The  Markov  Condition  allows  us  to  list  some  independence  relations  implied 
by  a  DAG.  These  relations  might  imply  other  independence  relations.  Con- 
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B 


FIGURE  17.4.  Yet  another  DAG. 


E 


sider  the  DAG  in  Figure  17.5.  The  Markov  Condition  implies: 


VnV,  x2n{x1,x4})  x3ux4 1  {x4,x2}, 


X4U{X2,X3}  |Xi,  x5n{x1;x2}  I  {X3,X4} 


It  turns  out  (but  it  is  not  obvious)  that  these  conditions  imply  that 


{i4,i5}ni2 1  {xux3}. 


How  do  we  find  these  extra  independence  relations?  The  answer  is  “d- 
separation”  which  means  “directed  separation.”  d-separation  can  be  summa¬ 
rized  by  three  rules.  Consider  the  four  DAG’s  in  Figure  17.6  and  the  DAG  in 
Figure  17.7.  The  first  3  DAG’s  in  Figure  17.6  have  no  colliders.  The  DAG  in 
the  lower  right  of  Figure  17.6  has  a  collider.  The  DAG  in  Figure  17.7  has  a 
collider  with  a  descendant. 
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x4 

FIGURE  17.5.  And  yet  another  DAG. 


FIGURE  17.6.  The  first  three  DAG’s  have  no  colliders.  The  fourth  DAG  in  the  lower 
right  corner  has  a  collider  at  Y. 


FIGURE  17.7.  A  collider  with  a  descendant. 
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X - ^  U  ^ -  V  - ^  W  ^ -  Y 

\  i 

Si  s2 

FIGURE  17.8.  d-separation  explained. 

The  Rules  of  d-Separation 
Consider  the  DAGs  in  Figures  17.6  and  17.7. 

1.  When  Y  is  not  a  collider,  X  and  Z  are  d-connected,  but  they  are 
d-separated  given  Y. 

2.  If  X  and  Z  collide  at  Y ,  then  X  and  Z  are  d-separated,  but  they 
are  d-connected  given  Y. 

3.  Conditioning  on  the  descendant  of  a  collider  has  the  same  effect  as 
conditioning  on  the  collider.  Thus  in  Figure  17.7,  X  and  Z  are 
d-separated  but  they  are  d-connected  given  W. 


Here  is  a  more  formal  definition  of  d-separation.  Let  X  and  Y  be  distinct 
vertices  and  let  W  be  a  set  of  vertices  not  containing  X  or  7.  Then  X  and 
Y  are  d-separated  given  W  if  there  exists  no  undirected  path  U  between 
X  and  Y  such  that  (i)  every  collider  on  U  has  a  descendant  in  W,  and  (ii) 
no  other  vertex  on  U  is  in  W.  If  A,  F>,  and  W  are  distinct  sets  of  vertices  and 
A  and  B  are  not  empty,  then  A  and  B  are  d-separated  given  W  if  for  every 
X  £  A  and  Y  £  B,  X  and  Y  are  d-separated  given  W.  Sets  of  vertices  that 
are  not  d-separated  are  said  to  be  d-connected. 

17.9  Example.  Consider  the  DAG  in  Figure  17.8.  From  the  d-separation  rules 
we  conclude  that: 

X  and  Y  are  d-separated  (given  the  empty  set); 

X  and  Y  are  d-connected  given  {Si,  S2}; 

X  and  Y  are  d-separated  given  {Si,S2,U}. 

17.10  Theorem.  2  Let  A,  B,  andC  be  disjoint  sets  of  vertices.  Then  ABB  \  C 
if  and  only  if  A  and  B  are  d-separated  by  C. 

2We  implicitly  assume  that  P  is  faithful  to  Q  which  means  that  P  has  no  extra  independence 
relations  other  than  those  logically  implied  by  the  Markov  Condition. 
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aliens  watch 


late 


FIGURE  17.9.  Jordan’s  alien  example  (Example  17.11).  Was  your  friend  kidnapped 
by  aliens  or  did  you  forget  to  set  your  watch? 

17.11  Example.  The  fact  that  conditioning  on  a  collider  creates  dependence 
might  not  seem  intuitive.  Here  is  a  whimsical  example  from  Jordan  (2004)  that 
makes  this  idea  more  palatable.  Your  friend  appears  to  be  late  for  a  meeting 
with  you.  There  are  two  explanations:  she  was  abducted  by  aliens  or  you  forgot 
to  set  your  watch  ahead  one  hour  for  daylight  savings  time.  (See  Figure  17.9.) 
Aliens  and  Watch  are  blocked  by  a  collider  which  implies  they  are  marginally 
independent.  This  seems  reasonable  since  —  before  we  know  anything  about 
your  friend  being  late  —  we  would  expect  these  variables  to  be  independent. 
We  would  also  expect  that  P(Aliens  =  yes|Late  =  yes)  >  P(Aliens  =  yes); 
learning  that  your  friend  is  late  certainly  increases  the  probability  that  she 
was  abducted.  But  when  we  learn  that  you  forgot  to  set  your  watch  properly, 
we  would  lower  the  chance  that  your  friend  was  abducted.  Hence,  P( Aliens  = 
yes|Late  =  yes)  /  P(Aliens  =  yes|Late  =  yes,  Watch  =  no).  Thus,  Aliens  and 
Watch  are  dependent  given  Late.  ■ 

17.12  Example.  Consider  the  DAG  in  Figure  17.2.  In  this  example,  over¬ 
weight  and  smoking  are  marginally  independent  but  they  are  dependent  given 
heart  disease.  ■ 

Graphs  that  look  different  may  actually  imply  the  same  independence  re¬ 
lations.  If  Q  is  a  DAG,  we  let  T(Q)  denote  all  the  independence  statements 
implied  by  Q.  Two  DAGs  Gi  and  G2  for  the  same  variables  V  are  Markov 
equivalent  if  T(Gi)  =  Given  a  DAG  G-,  let  skeleton (G)  denote  the 

undirected  graph  obtained  by  replacing  the  arrows  with  undirected  edges. 

17.13  Theorem.  Two  DAGs  G 1  and  G2  are  Markov  equivalent  if  and  only  if 
(i)  skeleton(^i)  =  skeleton^)  and  (ii)  G 1  and  G2  have  the  same  unshielded 
colliders. 

17.14  Example.  The  first  three  DAGs  in  Figure  17.6  are  Markov  equivalent. 
The  DAG  in  the  lower  right  of  the  Figure  is  not  Markov  equivalent  to  the 
others.  ■ 
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17.6  Estimation  for  DAGs 

Two  estimation  questions  arise  in  the  context  of  DAGs.  First,  given  a  DAG 
Q  and  data  Vi, . . . ,  Vn  from  a  distribution  /  consistent  with  how  do  we 
estimate  /?  Second,  given  data  Vi, . . . ,  Vn  how  do  we  estimate  Q ?  The  first 
question  is  pure  estimation  while  the  second  involves  model  selection.  These 
are  very  involved  topics  and  are  beyond  the  scope  of  this  book.  We  will  just 
briefly  mention  the  main  ideas. 

Typically,  one  uses  some  parametric  model  f(x\irx]Ox)  for  each  conditional 
density.  The  likelihood  function  is  then 

n  n  m 

m = n  w;  e) = n  n  uu  -vg- 

i— 1  i=l j=l 

where  is  the  value  of  Xj  for  the  7th  data  point  and  Oj  are  the  parameters  for 

the  jth  conditional  density.  We  can  then  estimate  the  parameters  by  maximum 
likelihood. 

To  estimate  the  structure  of  the  DAG  itself,  we  could  fit  every  possible  DAG 
using  maximum  likelihood  and  use  AIC  (or  some  other  method)  to  choose  a 
DAG.  However,  there  are  many  possible  DAGs  so  you  would  need  much  data 
for  such  a  method  to  be  reliable.  Also,  searching  through  all  possible  DAGs 
is  a  serious  computational  challenge.  Producing  a  valid,  accurate  confidence 
set  for  the  DAG  structure  would  require  astronomical  sample  sizes.  If  prior 
information  is  available  about  part  of  the  DAG  structure,  the  computational 
and  statistical  problems  are  at  least  partly  ameliorated. 

17.7  Bibliographic  Remarks 

There  are  a  number  of  texts  on  DAGs  including  Edwards  (1995)  and  Jordan 
(2004).  The  first  use  of  DAGs  for  representing  causal  relationships  was  by 
Wright  (1934).  Modern  treatments  are  contained  in  Spirtes  et  al.  (2000)  and 
Pearl  (2000).  Robins  et  al.  (2003)  discuss  the  problems  with  estimating  causal 
structure  from  data. 

17.8  Appendix 

Causation  Revisited.  We  discussed  causation  in  Chapter  16  using  the  idea 
of  counterfact ual  random  variables.  A  different  approach  to  causation  uses 
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FIGURE  17.10.  Conditioning  versus  intervening. 


DAGs.  The  two  approaches  are  mathematically  equivalent  though  they  appear 
to  be  quite  different.  In  the  DAG  approach,  the  extra  element  is  the  idea  of 
intervention.  Consider  the  DAG  in  Figure  17.10. 

The  probability  function  for  a  distribution  consistent  with  this  DAG  has 
the  form  f(x,y,z)  =  f  (x)  f  (y\x)  f  (z\x ,  y).  The  following  is  pseudocode  for 
generating  from  this  distribution. 


For  i 

— 

1, . . . ,  n  : 

Xi 

<- 

Px(xi ) 

Vi 

<- 

PY\x(Vi\xi) 

Zi 

<- 

Pz\x,y{zi  xij  y%) 

Suppose  we  repeat  this  code  many  times,  yielding  data  (aq,  2/i ,  ) ,  -  *  * ,  {xn^Vn 

Among  all  the  times  that  we  observe  Y  =  ?/,  how  often  is  Z  =  z?  The  answer 
to  this  question  is  given  by  the  conditional  distribution  of  Z\Y.  Specifically, 


P(Z  =  z\Y  =  y) 


¥(Y  =  y,Z  =  z)  =  f(y,z) 

P(r  =  y)  f(y) 

E  Xf(x>y>z)  Y,xf(x)f(y\x)f(z\x>y ) 


f(y) 


f(y) 


tt  ,  -,f(y\x)f(x)  ^  ti  i  ^f(x^y) 

=  ft^=y(=M)7fer 


X 


Now  suppose  we  intervene  by  changing  the  computer  code.  Specifically,  sup¬ 
pose  we  fix  Y  at  the  value  y.  The  code  now  looks  like  this: 


set  Y 

— 

y 

for  i 

— 

1, . . . ,  n 

Xi 

<- 

Px{Xi) 

Zi 

<- 

PZ\X,y(zi  Xi >  y) 
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Having  set  Y  =  y,  how  often  was  Z  =  z?  To  answer,  note  that  the  inter¬ 
vention  has  changed  the  joint  probability  to  be 

f*(x,z )  =  f(x)f(z\x,y). 

The  answer  to  our  question  is  given  by  the  marginal  distribution 

/*oo  =  = 

ry>  ry> 

tv  tv 

We  shall  denote  this  as  ¥(Z  =  z\Y  :=  y)  or  f(z\Y  :=  y).  We  call  ¥{Z  = 
z\Y  =  y)  conditioning  by  observation  or  passive  conditioning.  We  call 
¥(Z  =  z\Y  :=  y)  conditioning  by  intervention  or  active  conditioning. 
Passive  conditioning  is  used  to  answer  a  predictive  question  like: 

“Given  that  Joe  smokes,  what  is  the  probability  he  will  get  lung  cancer?” 
Active  conditioning  is  used  to  answer  a  causal  question  like: 

“If  Joe  quits  smoking,  what  is  the  probability  he  will  get  lung  cancer?” 
Consider  a  pair  (Q,  P)  where  Q  is  a  DAG  and  P  is  a  distribution  for  the 
variables  V  of  the  DAG.  Let  p  denote  the  probability  function  for  P.  Con¬ 
sider  intervening  and  fixing  a  variable  X  to  be  equal  to  x.  We  represent  the 
intervention  by  doing  two  things: 

(1)  Create  a  new  DAG  Q*  by  removing  all  arrows  pointing  into  X ; 

(2)  Create  a  new  distribution  f*(v )  =  ¥(V  =  v\X  :=  x)  by  removing  the 
term  f(x\irx)  from  f{v). 

The  new  pair  (£?*,/*)  represents  the  intervention  “set  X  =  x.” 

17.15  Example.  You  may  have  noticed  a  correlation  between  rain  and  having 
a  wet  lawn,  that  is,  the  variable  “Rain”  is  not  independent  of  the  variable  “Wet 
Lawn”  and  hence  Pr,w(t,w)  ^  Pr{t)pw{vo)  where  R  denotes  Rain  and  W 
denotes  Wet  Lawn.  Consider  the  following  two  DAGs: 

Rain  — >  Wet  Lawn  Rain  < —  Wet  Lawn. 

The  first  DAG  implies  that  f(w,r)  =  f(r)f(w\r)  while  the  second  implies 
that  f(w,r)  =  f(w)f(r\w)  No  matter  what  the  joint  distribution  f(w,r)  is, 
both  graphs  are  correct.  Both  imply  that  R  and  W  are  not  independent.  But, 
intuitively,  if  we  want  a  graph  to  indicate  causation,  the  first  graph  is  right 
and  the  second  is  wrong.  Throwing  water  on  your  lawn  doesn’t  cause  rain. 
The  reason  we  feel  the  first  is  correct  while  the  second  is  wrong  is  because  the 
interventions  implied  by  the  first  graph  are  correct. 

Look  at  the  first  graph  and  form  the  intervention  W  =  1  where  1  denotes 
“wet  lawn.”  Following  the  rules  of  intervention,  we  break  the  arrows  into  W 
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to  get  the  modified  graph: 


Rain 


set  Wet  Lawn  =1 


with  distribution  /*(r)  =  /(r).  Thus  P(R  =  r  |  W  :=  w)  =  P(R  =  r)  tells  us 
that  “wet  lawn”  does  not  cause  rain. 

Suppose  we  (wrongly)  assume  that  the  second  graph  is  the  correct  causal 
graph  and  form  the  intervention  W  =  1  on  the  second  graph.  There  are  no 
arrows  into  W  that  need  to  be  broken  so  the  intervention  graph  is  the  same 
as  the  original  graph.  Thus  /*(r)  =  f{r\w)  which  would  imply  that  changing 
“wet”  changes  “rain.”  Clearly,  this  is  nonsense. 

Both  are  correct  probability  graphs  but  only  the  first  is  correct  causally. 
We  know  the  correct  causal  graph  by  using  background  knowledge. 


17.16  Remark.  We  could  try  to  learn  the  correct  causal  graph  from  data  but 
this  is  dangerous.  In  fact  it  is  impossible  with  two  variables.  With  more  than 
two  variables  there  are  methods  that  can  find  the  causal  graph  under  certain 
assumptions  but  they  are  large  sample  methods  and,  furthermore,  there  is  no 
way  to  ever  know  if  the  sample  size  you  have  is  large  enough  to  make  the 
methods  reliable. 


We  can  use  DAGs  to  represent  confounding  variables.  If  X  is  a  treatment 
and  Y  is  an  outcome,  a  confounding  variable  Z  is  a  variable  with  arrows  into 
both  X  and  Y\  see  Figure  17.11.  It  is  easy  to  check,  using  the  formalism  of 
interventions,  that  the  following  facts  are  true: 

In  a  randomized  study,  the  arrow  between  Z  and  X  is  broken.  In  this  case, 
even  with  Z  unobserved  (represented  by  enclosing  Z  in  a  circle),  the  causal 
relationship  between  X  and  Y  is  estimable  because  it  can  be  shown  that 
E(Y\X  :  =  x)  =  E(Y|X  =  x)  which  does  not  involve  the  unobserved  Z.  In 
an  observational  study,  with  all  confounders  observed,  we  get  E(T|X  :=  x)  = 
JE(Y\X  =  x,  Z  =  z)dFz(z)  as  in  formula  (16.7).  If  Z  is  unobserved  then  we 
cannot  estimate  the  causal  effect  because  E(Y\X  :=  x)  =  f  E(Y\X  =  x,  Z  = 
z)dFz{z)  involves  the  unobserved  Z.  We  can’t  just  use  X  and  Y  since  in  this 
case.  P(T  =  y\X  =  x)  ^  P(T  =  y\X  :=  x)  which  is  just  another  way  of  saying 
that  causation  is  not  association. 

In  fact,  we  can  make  a  precise  connection  between  DAGs  and  counterfac- 
tuals  as  follows.  Suppose  that  X  and  Y  are  binary.  Define  the  confounding 
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FIGURE  17.11.  Randomized  study;  Observational  study  with  measured  con- 
founders;  Observational  study  with  unmeasured  confounders.  The  circled  variables 
are  unobserved. 


variable  Z  by 

'l  if  (C0,  <V)  =  (0,0) 

I  2  if  (Co,C1)  =  (0,l) 

3  if  (C0,C!)  =  (1,0) 
k4  if  (C0,  Ci)  =  (1,1)- 

From  this,  you  can  make  the  correspondence  between  the  DAG  approach  and 
the  counterfactual  approach  explicit.  I  leave  this  for  the  interested  reader. 


17.9  Exercises 


1.  Show  that  (17.1)  and  (17.2)  are  equivalent. 

2.  Prove  Theorem  17.2. 


3.  Let  X,  Y  and  Z  have  the  following  joint  distribution: 


Y= 0  Y=1 

X  =  0  .405  .045 

X  =  1  .045  .005 

Z  =  0 


Y= 0  Y=1 

X  =  0  .125  .125 

X  =  1  .125  .125 

Z  =  1 


(a)  Find  the  conditional  distribution  of  X  and  Y  given  Z  =  0  and  the 
conditional  distribution  of  X  and  Y  given  Z  =  1. 

(b)  Show  that  X  J1Y\Z. 

(c)  Find  the  marginal  distribution  of  X  and  Y . 

(d)  Show  that  X  and  Y  are  not  marginally  independent. 


4.  Consider  the  three  DAGs  in  Figure  17.6  without  a  collider.  Prove  that 
XUZIY. 
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FIGURE  17.12.  DAG  for  exercise  7. 

5.  Consider  the  DAG  in  Figure  17.6  with  a  collider.  Prove  that  X II Z  and 
that  X  and  Z  are  dependent  given  Y. 

6.  Let  X  G  {0,1},  Y  G  {0,1},  Z  e  {0,1,2}.  Suppose  the  distribution  of 
(X,  Y,  Z)  is  Markov  to: 

X  — >  Y  — >  Z 

Create  a  joint  distribution  /(#,  y,  z)  that  is  Markov  to  this  DAG.  Gen¬ 
erate  1000  random  vectors  from  this  distribution.  Estimate  the  distribu¬ 
tion  from  the  data  using  maximum  likelihood.  Compare  the  estimated 
distribution  to  the  true  distribution.  Let  6  =  (#ooo ,  0ooi>  •  •  • ,  #112)  where 
6rst  =  P(X  =  r,  Y  =  s,Z  =  t).  Use  the  bootstrap  to  get  standard  errors 
and  95  percent  confidence  intervals  for  these  12  parameters. 

7.  Consider  the  DAG  in  Figure  17.12. 

(a)  Write  down  the  factorization  of  the  joint  density. 

(b)  Prove  that  X  II  Zj. 

8.  Let  V  =  (X,  Y,  Z)  have  the  following  joint  distribution 

X  rsj  Bernoulli 


(a)  Find  an  expression  for  P (Z  =  z  \Y  =  y).  In  particular,  find  P (Z  = 
1  |  Y  =  1). 

(b)  Write  a  program  to  simulate  the  model.  Conduct  a  simulation  and 
compute  P(Z  =  1  |  Y  =  1)  empirically.  Plot  this  as  a  function  of 
the  simulation  size  N.  It  should  converge  to  the  theoretical  value  you 
computed  in  (a). 

(c)  (Refers  to  material  in  the  appendix.)  Write  down  an  expression  for 
P (Z  =  1  |  Y  :=  y).  In  particular,  find  P(Z  =  1  |  Y  :=  1). 

(d)  (Refers  to  material  in  the  appendix.)  Modify  your  program  to  sim¬ 
ulate  the  intervention  “set  Y  =  1.”  Conduct  a  simulation  and  compute 
P (Z  =  1  |  Y  :=  1)  empirically.  Plot  this  as  a  function  of  the  simulation 
size  N.  It  should  converge  to  the  theoretical  value  you  computed  in  (c). 

9.  This  is  a  continuous,  Gaussian  version  of  the  last  question.  Let  V  = 
(X,  Y,  Z)  have  the  following  joint  distribution 

X  Normal  (0, 1) 

Y  |  X  =  x  Normal  (ax,  1) 

Z  \  X  =  x,Y  =  y  ~  Normal  ((3y  +  yx,  1). 

Here,  a,  [3  and  7  are  fixed  parameters,  economists  refer  to  models  like 
this  as  structural  equation  models. 

(a)  Find  an  explicit  expression  for  f(z  \  y)  and  E (Z  \  Y  =  y)  =  f  zf(z 
y)dz. 

(b)  (Refers  to  material  in  the  appendix.)  Find  an  explicit  expression 
for  f(z  |  Y  :=  y)  and  then  find  E (Z  \  Y  :=  y)  =  f  zf(z  \  Y  :=  y)dy. 
Compare  to  (b). 

(c)  Find  the  joint  distribution  of  (Y,  Z).  Find  the  correlation  p  between 
Y  and  Z . 

(d)  (Refers  to  material  in  the  appendix.)  Suppose  that  X  is  not  observed 
and  we  try  to  make  causal  conclusions  from  the  marginal  distribution  of 
(Y,  Z).  (Think  of  X  as  unobserved  confounding  variables.)  In  particular, 
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suppose  we  declare  that  Y  causes  Z  if  p  ^  0  and  we  declare  that  Y  does 
not  cause  Z  if  p  =  0.  Show  that  this  will  lead  to  erroneous  conclusions. 

(e)  (Refers  to  material  in  the  appendix.)  Suppose  we  conduct  a  ran¬ 
domized  experiment  in  which  Y  is  randomly  assigned.  To  be  concrete, 
suppose  that 


X 

Y 


Z  \X  =  x,Y  =  y 


r^j  Normal(0, 1) 
Normal(cp  1) 

^  Normal(/?i/ +  7#,  1). 


Show  that  the  method  in  (d)  now  yields  correct  conclusions  (i.e.,  p  =  0 
if  and  only  if  f(z  \  Y  :=  y)  does  not  depend  on  y). 


18 

Undirected  Graphs 


Undirected  graphs  are  an  alternative  to  directed  graphs  for  representing  in¬ 
dependence  relations.  Since  both  directed  and  undirected  graphs  are  used  in 
practice,  it  is  a  good  idea  to  be  facile  with  both.  The  main  difference  between 
the  two  is  that  the  rules  for  reading  independence  relations  from  the  graph 
are  different. 


18.1  Undirected  Graphs 

An  undirected  graph  Q  =  (U,  E)  has  a  finite  set  V  of  vertices  (or  nodes) 
and  a  set  E  of  edges  (or  arcs)  consisting  of  pairs  of  vertices.  The  vertices 
correspond  to  random  variables  X,  Y,  Z, . . .  and  edges  are  written  as  unordered 
pairs.  For  example,  (X,  Y)  E  E  means  that  X  and  Y  are  joined  by  an  edge. 
An  example  of  a  graph  is  in  Figure  18.1. 

Two  vertices  are  adjacent,  written  X  ^  Y,  if  there  is  an  edge  between 
them.  In  Figure  18.1,  X  and  Y  are  adjacent  but  X  and  Z  are  not  adjacent.  A 
sequence  Xo, . . . ,  Xn  is  called  a  path  if  X^_i  ^  Xi  for  each  i.  In  Figure  18.1, 
X,  Y,  Z  is  a  path.  A  graph  is  complete  if  there  is  an  edge  between  every  pair 
of  vertices.  A  subset  U  C  U  of  vertices  together  with  their  edges  is  called  a 
subgraph. 


282 


18.  Undirected  Graphs 


Y 


FIGURE  18.1.  A  graph  with  vertices  V  =  {X,  Y,  Z}.  The  edge  set  is 

E={(X,Y),(Y,Z)}. 


FIGURE  18.2.  {Y,  W }  and  {Z}  are  separated  by  {-A}.  Also,  W  and  Z  are  separated 
by  {X,Y}. 

If  A,  B  and  C  are  three  distinct  subsets  of  V,  we  say  that  C  separates 
A  and  B  if  every  path  from  a  variable  in  A  to  a  variable  in  B  intersects  a 
variable  in  C.  In  Figure  18.2  {Y,  W}  and  {Z}  are  separated  by  {X}.  Also,  W 
and  Z  are  separated  by  {X,  Y}. 


18.2  Probability  and  Graphs 

Let  V  be  a  set  of  random  variables  with  distribution  P.  Construct  a  graph 
with  one  vertex  for  each  random  variable  in  V.  Omit  the  edge  between  a  pair 
of  variables  if  they  are  independent  given  the  rest  of  the  variables: 


no  edge  between  X  and  Y 


X  II  Y  rest 
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Y 


Z 


FIGURE  18.4.  No  implied  independence  relations. 


where  “rest”  refers  to  all  the  other  variables  besides  X  and  Y.  The  resulting 
graph  is  called  a  pairwise  Markov  graph.  Some  examples  are  shown  in 
Figures  18.3,  18.4,  18.5,  and  18.6. 

The  graph  encodes  a  set  of  pairwise  conditional  independence  relations. 
These  relations  imply  other  conditional  independence  relations.  How  can  we 
figure  out  what  they  are?  Fortunately,  we  can  read  these  other  conditional 
independence  relations  directly  from  the  graph  as  well,  as  is  explained  in  the 
next  theorem. 


18.1  Theorem.  Let  Q  =  (V,E)  be  a  pairwise  Markov  graph  for  a  distribution 
P.  Let  A,B  and  C  be  distinct  subsets  of  V  such  that  C  separates  A  and  B. 
Then  A  H  B\C . 


18.2  Remark.  If  A  and  B  are  not  connected  (i.e.,  there  is  no  path  from  A  to 
B)  then  we  may  regard  A  and  B  as  being  separated  by  the  empty  set.  Then 
Theorem  18.1  implies  that  AllB. 
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FIGURE  18.5.  X  II  Z\{Y,  W}  and  Y  II  W |{X,  Z}. 


X  Y  Z  W 

FIGURE  18.6.  Pairwise  independence  implies  that  X  YL  Z\{Y,W}.  But  is  XYLZ\Y? 

The  independence  condition  in  Theorem  18.1  is  called  the  global  Markov 
property.  We  thus  see  that  the  pairwise  and  global  Markov  properties  are 
equivalent.  Let  us  state  this  more  precisely.  Given  a  graph  Qy  let  Mpa[r(Q) 
be  the  set  of  distributions  which  satisfy  the  pairwise  Markov  property:  thus 
P  G  Mpa[r(Q)  if,  under  P,  X  II  T|rest  if  and  only  if  there  is  no  edge  between 
X  and  Y.  Let  Mg\0\>a\(Q)  be  the  set  of  distributions  which  satisfy  the  global 
Markov  property:  thus  P  G  Mpa[r(Q)  if,  under  P,  A  II  B\C  if  and  only  if  C 
separates  A  and  B. 

18.3  Theorem.  Let  Q  be  a  graph.  Then ,  Mpa[r(Q)  =  Mg\0\>a\(Q) . 

Theorem  18.3  allows  us  to  construct  graphs  using  the  simpler  pairwise  prop¬ 
erty  and  then  we  can  deduce  other  independence  relations  using  the  global 
Markov  property.  Think  how  hard  this  would  be  to  do  algebraically.  Returning 
to  18.6,  we  now  see  that  X  II  Z\Y  and  Y  II  W\Z. 

18.4  Example.  Figure  18.7  implies  that  X  II  T,  X  II  Z  and  X  II  (Y,  Z).  u 


18.5  Example.  Figure  18.8  implies  that  X  II  VU|(T,  Z)  and  X  II  Z\Y.  u 
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FIGURE  18.7.  X  II  Y,  X  II  Z  and  X  II  (Y,  Z). 

X  Y 

• — 

W 

z 

FIGURE  18.8.  XUW\(Y,Z)  and  XUZ\Y. 

18.3  Cliques  and  Potentials 

A  clique  is  a  set  of  variables  in  a  graph  that  are  all  adjacent  to  each  other.  A 
set  of  variables  is  a  maximal  clique  if  it  is  a  clique  and  if  it  is  not  possible 
to  include  another  variable  and  still  be  a  clique.  A  potential  is  any  positive 
function.  Under  certain  conditions,  it  can  be  shown  that  P  is  Markov  Q  if  and 
only  if  its  probability  function  /  can  be  written  as 

f{x)  =  Uce  c^cixc)  (18  1} 

Zj 

where  C  is  the  set  of  maximal  cliques  and 

z  =  Y II  V’cCc). 

x  CGC 

18.6  Example.  The  maximal  cliques  for  the  graph  in  Figure  18.1  are  C\  = 
{X,  Y}  and  C2  =  {T,  Z}.  Hence,  if  P  is  Markov  to  the  graph,  then  its  proba¬ 
bility  function  can  be  written 

fix ,  y,  z)  oc  ViC,  y)^2{y,  z) 


for  some  positive  functions  ^1  and  ■ 
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FIGURE  18.9.  The  maximumly  cliques  of  this  graph  are 
{X1,X2},{X1,X3},{X2,X4},{X3,X5},{X2,X5,X6}. 

18.7  Example.  The  maximal  cliques  for  the  graph  in  Figure  18.9  are 

{XUX2},  {X1;  X3},  {X2,  X4},  {X3,  X5},  {X2,  X5,  X6}. 

Thus  we  can  write  the  probability  function  as 

f(x  1,X2,X3,X4,X5,X6)  (X  lpl2{xi,  X2)tpi3(xi,  x3)lp24(x2,  X4) 

xt[)35(x3,x5)tl>256(x2,x5,x6).  • 


18.4  Fitting  Graphs  to  Data 

Given  a  data  set,  how  do  we  find  a  graphical  model  that  fits  the  data?  As 
with  directed  graphs,  this  is  a  big  topic  that  we  will  not  treat  here.  However, 
in  the  discrete  case,  one  way  to  fit  a  graph  to  data  is  to  use  a  log-linear 
model,  which  is  the  subject  of  the  next  chapter. 


18.5  Bibliographic  Remarks 

Thorough  treatments  of  undirected  graphs  can  be  found  in  Whittaker  (1990) 
and  Lauritzen  (1996).  Some  of  the  exercises  below  are  from  Whittaker  (1990). 


18.6  Exercises 

1.  Consider  random  variables  (Xi,  X2,Xs).  In  each  of  the  following  cases, 
draw  a  graph  that  has  the  given  independence  relations. 
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X!  X2  X4 


x3 

FIGURE  18.10. 


X\  x2  x3  x4 


FIGURE  18.11. 

(a)  IiIII3  |  X2. 

(b)  X1 II  X2  |  X3  and  X4  II X3  |  X2. 

(c)  Xi  II  X2  |  X3  and  Xx  II  X3  |  X2  and  X2  H  X3  |  X4. 

2.  Consider  random  variables  (Xi, X2, X3, X4).  In  each  of  the  following 
cases,  draw  a  graph  that  has  the  given  independence  relations. 

(a)  X\  II  X3  |  X2,  X4  and  X\  II  X4  |  X2,  X3  and  X2  II  X4  |  Xi,  X3. 

(b)  Xi  II  X2  |  X3,  X4  and  X4 II X3  |  X2,  X4  and  X2  II X3  |  Xu  X4. 

(c)  Xi  II  X3  |  X2,  X4  and  X2  II X4  |  Xu  X3. 

3.  A  conditional  independence  between  a  pair  of  variables  is  minimal  if  it 
is  not  possible  to  use  the  Separation  Theorem  to  eliminate  any  variable 
from  the  conditioning  set,  i.e.  from  the  right  hand  side  of  the  bar  Whit¬ 
taker  (1990).  Write  down  the  minimal  conditional  independencies  from: 
(a)  Figure  18.10;  (b)  Figure  18.11;  (c)  Figure  18.12;  (d)  Figure  18.13. 

4.  Let  Xi,X2,X3  be  binary  random  variables.  Construct  the  likelihood 
ratio  test  for 

ff0  :  Xi  IIX2|X3  versus  Hi  :  Xiis  not  independent  of  X2|X3. 

5.  Here  are  breast  cancer  data  from  Morrison  et  al.  (1973)  on  diagnostic 
center  (Xi),  nuclear  grade  (X2),  and  survival  (X3): 
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FIGURE  18.12. 

X! 


X4  x5  x6 


FIGURE  18.13. 

X2  malignant  malignant  benign  benign 

X3  died  survived  died  survived 

X\  Boston  35  59  47  112 

Glamorgan  42  77  26  76 

(a)  Treat  this  as  a  multinomial  and  find  the  maximum  likelihood  esti¬ 
mator. 

(b)  If  someone  has  a  tumor  classified  as  benign  at  the  Glamorgan  clinic, 
what  is  the  estimated  probability  that  they  will  die?  Find  the  standard 
error  for  this  estimate. 
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(c)  Test  the  following  hypotheses: 


xx  nx2|x3 

Xx  IIX3|X2 
X2  II  X3  X1 


versus 

versus 

versus 


Xi  X2  X 
Xi  TMT  x3  X 
X2  X3  X 


3 

2 

1 


Use  the  test  from  question  4.  Based  on  the  results  of  your  tests,  draw 
and  interpret  the  resulting  graph. 


19 

Log-Linear  Models 


In  this  chapter  we  study  log- linear  models  which  are  useful  for  modeling 
multivariate  discrete  data.  There  is  a  strong  connection  between  log-linear 
models  and  undirected  graphs. 


19.1  The  Log-Linear  Model 

Let  X  =  (Xi, . . . ,  Xm)  be  a  discrete  random  vector  with  probability  function 

f(x)  =  ¥(X  =  x)  =  P(Xi  =  xu  . . . ,  Xm  =  xm) 

where  x  =  (#i, . . . ,  xm).  Let  rj  be  the  number  of  values  that  Xj  takes.  Without 
loss  of  generality,  we  can  assume  that  Xj  G  {0, 1, . . .  ,  —  1}.  Suppose  now 

that  we  have  n  such  random  vectors.  We  can  think  of  the  data  as  a  sample 
from  a  Multinomial  with  N  =  r\  x  x  •  •  •  x  rm  categories.  The  data  can  be 
represented  as  counts  in  a  ri  x  r2  x  •  •  •  x  rm  table.  Let  p  =  (pi, . . .  ,pw)  denote 
the  multinomial  parameter. 

Let  S  =  {1, . . . ,  m}.  Given  a  vector  x  =  (#i, . . . ,  xm)  and  a  subset  A  C  S', 
let  x a  =  (xj  :  j  G  A).  For  example,  if  A  =  {1,  3}  then  xa  =  (^1,^3)- 
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19.1  Theorem.  The  joint  probability  function  f(x)  of  a  single  random  vector 
X  =  (Xi, . . . ,  Xm)  can  be  written  as 

log  f(x)  =  ^2  i’Aix)  (19.1) 

AcS 

where  the  sum  is  over  all  subsets  A  of  S  =  {1, . . . ,  m}  and  the  ’ s  satisfy  the 
following  conditions: 

1.  is  a  constant ; 

2.  For  every  A  C  S,  ^a{x)  is  only  a  function  of  x a  and  not  the  rest  of  the 

XjS. 

3.  If  i  G  A  and  xi  =  0,  then  a{x )  =  0. 

The  formula  in  equation  (19.1)  is  called  the  log- linear  expansion  of  /. 
Each  jjA(x)  may  depend  on  some  unknown  parameters  [3a •  Let  [3  =  (/ 3a  • 
A  C  S)  be  the  set  of  all  these  parameters.  We  will  write  f(x)  =  f(x;  [3)  when 
we  want  to  emphasize  the  dependence  on  the  unknown  parameters  [3. 

In  terms  of  the  multinomial,  the  parameter  space  is 

f  N 

V  =  lp=  (pi,  . . .  ,pN)  :  Pj  >0,  2jpj  =  1 
^  j=1 

This  is  an  N  —  1  dimensional  space.  In  the  log-linear  representation,  the  pa¬ 
rameter  space  is 


0  =  <yP=  (Pu---,Pn)  :  P  =  P(p),P£'Pj 

where  /3(p)  is  the  set  of  [3  values  associated  with  p.  The  set  ©  is  a  N  —  1 
dimensional  surface  in  R^.  We  can  always  go  back  and  forth  between  the  two 
parameterizations  we  can  write  [3  =  [3 (p)  and  p  =  p(/3). 

19.2  Example.  Let  X  ~  Bernoulli(p)  where  0  <  p  <  1.  We  can  write  the 
probability  mass  function  for  X  as 

f{x)=px{l-p)l~x  =  Pi  p\~x 

for  x  =  0, 1,  where  p\  =  p  and  P2  =  1  —  p.  Hence, 


log  f(x)  =  ipnix)  +  A{x) 
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where 


00  fa)  =  log  (p2) 


0ifa) 


X  log 


Pi 

P2 


Notice  that  'ip^(x)  is  a  constant  (as  a  function  of  x)  and  'ipi(x)  =  0  when  x  =  0. 
Thus  the  three  conditions  of  Theorem  19.1  hold.  The  log-linear  parameters 
are 

'Pi 


A)  =  logfe),  Al  =  log 


P2 


The  original,  multinomial  parameter  space  is  V  =  {(ti?T2)  :  Pj  >  O7T1+T2  = 
1}.  The  log-linear  parameter  space  is 


©  =  (A), /3i)  G  M2  :  e 


do+dl 


+  e00  =  1 


Given  (pi,T2)  we  can  solve  for  (/3q,  A).  Conversely,  given  (/?o,  Al)  we  can  solve 
for  (pi,p2).  ■ 


19.3  Example.  Let  X  =  (Xi,X2)  where  X\  G  {0,1}  and  X2  G  {0,1,2}.  The 
joint  distribution  of  n  such  random  vectors  is  a  multinomial  with  6  categories. 


The  multinomial  parameters  can 

be 

written  as 

a  2- by- 3  table  as  follows: 

multinomial 

x2  0 

1 

2 

X\ 

0 

Too 

T01 

T02 

1 

P10 

T11 

Tl2 

The  n  data  vectors  can  be  summarized  as  counts: 


data 

X2 

0 

1 

2 

X\ 

0 

Coo 

On 

0)2 

1 

C'io 

Cn 

e12 

For  x  =  (xi,X2),  the  log-linear  expansion  takes  the  form 

log/fa)  =  00  fa)  +  01  fa)  +  02  fa)  +  -012  (x) 

where 


00  fa) 

01  fa) 

02  AO 

012(2:) 


log  Poo 

Zilogf  — ) 

\P00J 
I(X2  =  1 )  log 

/fal  =  l,x2  = 


901 


+  I{x2  =  2)  log  (— 

00  J  \Too 

1  ( TnToo\  T(  1 

log  -  +  I(x  1  =  l,x2 

VT01T10/ 


2)logG^ °° 


T01T10 


T02T10 
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Convince  yourself  that  the  three  conditions  on  the  ^ s  of  the  theorem  are 
satisfied.  The  six  parameters  of  this  model  are: 

01  =  logpoo  02  =  log  (g^)  03  =  log  (g^) 

■ 

/?4  =  log  (^)  =  log  fa  =  . 

^  \P00  J  \P0lPl0  )  \P02P10  ) 

The  next  theorem  gives  an  easy  way  to  check  for  conditional  independence 
in  a  log-linear  model. 

19.4  Theorem.  Let  (Xa,  X&,  Xc)  be  a  partition  of  a  vectors  (Xi,  Xm). 
Then  X 5  II  Xc\Xa  if  and  only  if  all  the  fj -terms  in  the  log-linear  expansion 
that  have  at  least  one  coordinate  in  b  and  one  coordinate  in  c  are  0. 

To  prove  this  theorem,  we  will  use  the  following  lemma  whose  proof  follows 
easily  from  the  definition  of  conditional  independence. 

19.5  Lemma.  A  partition  (Xa,  X5,  Xc)  satisfies  X 5  II  Xc\Xa  if  and  only  if 
f(xa,Xb,xc)  =  g(xa,  Xb)h(xa,  xc)  for  some  functions  g  and  h 

Proof.  (Theorem  19.4.)  Suppose  that  ipt  is  0  whenever  t  has  coordinates 
in  b  and  c.  Hence,  ipt  is  0  if  t  <j£  a  |J  b  or  t  <j£  a  |J  c.  Therefore 

iog/(x)=  y  Mx)+  Y  ^t{x) 

tGa[Jb  tGa[Jc  tGa 

Exponentiating,  we  see  that  the  joint  density  is  of  the  form  g(xa,  Xb)h(xa,  xc). 
By  Lemma  19.5,  Xb~ilXc\Xa.  The  converse  follows  by  reversing  the  argument. 


19.2  Graphical  Log-Linear  Models 

A  log-linear  model  is  graphical  if  missing  terms  correspond  only  to  condi¬ 
tional  independence  constraints. 

19.6  Definition.  Let  log  f(x)  =  a  ^°9~^near  model.  Then 

f  is  graphical  if  all  ip -terms  are  nonzero  except  for  any  pair  of 
coordinates  not  in  the  edge  set  for  some  graph  Q.  In  other  words , 

Pja(x)  =  0  if  and  only  if  {i,j}  C  A  and  (i,j)  is  not  an  edge. 

Here  is  a  way  to  think  about  the  definition  above: 
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X 


5 


FIGURE  19.1.  Graph  for  Example  19.7. 

If  you  can  add  a  term  to  the  model  and  the  graph  does  not  change, 
then  the  model  is  not  graphical. 

19.7  Example.  Consider  the  graph  in  Figure  19.1. 

The  graphical  log-linear  model  that  corresponds  to  this  graph  is 

log  f(x)  =  V'a  +  i^i(x)  +i>2(x)  +  i>3(x)  +  i>4(x)  + 

+  i’Mx)  +  tp23(x)  +  1p25(x)  +  tp34(x)  +  t[>35  (x)  +  1p45(x)  +  1p235(x)  +  1p34b(x). 

Let’s  see  why  this  model  is  graphical.  The  edge  (1,  5)  is  missing  in  the  graph. 

Hence  any  term  containing  that  pair  of  indices  is  omitted  from  the  model.  For 
example, 


015?  0125?  0135?  0145?  '01235?  01245?  01345?  012345 
are  all  omitted.  Similarly,  the  edge  (2,  4)  is  missing  and  hence 

024?  0124?  0234  ?  0245  ?  01234  ?  01245  ?  02345?  012345 

are  all  omitted.  There  are  other  missing  edges  as  well.  You  can  check  that  the 
model  omits  all  the  corresponding  0  terms.  Now  consider  the  model 

log/(x)  =  Ip^x)  +  tpi(x)  +  tp2(x)  +  ^(x)  +  tp4(x)  +  ij5(x) 

+  Vh  2{x)  +  ^23(3?)  +  ^25(3?)  +  +  ^35(3;)  +  1p45{x). 

This  is  the  same  model  except  that  the  three  way  interactions  were  removed. 
If  we  draw  a  graph  for  this  model,  we  will  get  the  same  graph.  For  example, 
no  0  terms  contain  (1,5)  so  we  omit  the  edge  between  X\  and  X5.  But  this  is 
not  graphical  since  it  has  extra  terms  omitted.  The  independencies  and  graphs 
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x2  Xi  x3 

FIGURE  19.2.  Graph  for  Example  19.10. 

for  the  two  models  are  the  same  but  the  latter  model  has  other  constraints 
besides  conditional  independence  constraints.  This  is  not  a  bad  thing.  It  just 
means  that  if  we  are  only  concerned  about  presence  or  absence  of  conditional 
independences,  then  we  need  not  consider  such  a  model.  The  presence  of  the 
three-way  interaction  ^235  means  that  the  strength  of  association  between  X2 
and  X3  varies  as  a  function  of  X5.  Its  absence  indicates  that  this  is  not  so.  ■ 


19.3  Hierarchical  Log-Linear  Models 

There  is  a  set  of  log-linear  models  that  is  larger  than  the  set  of  graphical 
models  and  that  are  used  quite  a  bit.  These  are  the  hierarchical  log-linear 
models. 

19.8  Definition.  A  log-linear  model  is  hierarchical  iftpA  —  0  and  A  C  B 
implies  that  —  0. 


19.9  Lemma.  A  graphical  model  is  hierarchical  but  the  reverse  need  not  be 
true. 

19.10  Example.  Let 

log  f(x)  =  +  1pl(x)  +  i>2(x)  +  i’zi.x)  +  1pl2(x)  +  1piz(x). 

The  model  is  hierarchical;  its  graph  is  given  in  Figure  19.2.  The  model  is 
graphical  because  all  terms  involving  (2,3)  are  omitted.  It  is  also  hierarchical. 


19.11  Example.  Let 


log  f(x)  =  1p$(x)  +  1pl(x)  +  1p2(x)  +  1p3(x)  +  i’uix)  +  ^13(2)  +  ^23{x). 


19.4  Model  Generators 
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X 


1 


X 


2 


FIGURE  19.3.  The  graph  is  complete.  The  model  is  hierarchical  but  not  graphical. 


Xx  X2  X3 

FIGURE  19.4.  The  model  for  this  graph  is  not  hierarchical. 

The  model  is  hierarchical.  It  is  not  graphical.  The  graph  corresponding  to  this 
model  is  complete;  see  Figure  19.3.  It  is  not  graphical  because  ^123  (#)  =  0 
which  does  not  correspond  to  any  pairwise  conditional  independence.  ■ 

19.12  Example.  Let 


log  f(x)  =  ip<b(x )  +  Ipz{x)  +  ipn(x). 

The  graph  corresponding  is  in  Figure  19.4.  This  model  is  not  hierarchical  since 
^2  =  0  but  ^12  is  not.  Since  it  is  not  hierarchical,  it  is  not  graphical  either.  ■ 


19.4  Model  Generators 

Hierarchical  models  can  be  written  succinctly  using  generators.  This  is  most 
easily  explained  by  example.  Suppose  that  X  =  (Xi,X2,X3).  Then,  M  = 
1.2  +  1.3  stands  for 

log  /  =  ^0  +  Vh  +  ^2  +  Vm  +  Vh2  +  VT3. 
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The  formula  M  =  1.2+1. 3  says:  “include  ^12  and  Vh3-”  We  have  to  also  include 
the  lower  order  terms  or  it  won’t  be  hierarchical.  The  generator  M  =  1.2.3  is 
the  saturated  model 


log  /  =  00  +  01  +  02  +  03  +  012  +  013  +  023  +  0123- 

The  saturated  models  corresponds  to  fitting  an  unconstrained  multinomial. 
Consider  M  =  1  +  2  +  3  which  means 


log  /  =  00  +  0 1  +02+03- 


This  is  the  mutual  independence  model.  Finally,  consider  M  =  1.2  which  has 
log-linear  expansion 


This  model  makes  X3  X 


log  /  =  00  +  01  +  02  +  012- 

2  =  3+ ,  X\  =  x\  a  uniform  distribution. 


19.5  Fitting  Log-Linear  Models  to  Data 

Let  f3  denote  all  the  parameters  in  a  log-linear  model  M .  The  loglikelihood 
for  /3  is 

n 

m  =5>g/(V;/3) 

i=  1 

where  /(JQ;/3)  is  the  probability  function  for  the  Th  random  vector  Xi  = 
(X^i, . . . ,  Xirn)  as  give  by  equation  (19.1).  The  mle  (3  generally  has  to  be 
found  numerically.  The  Fisher  information  matrix  is  also  found  numerically 
and  we  can  then  get  the  estimated  standard  errors  from  the  inverse  Fisher 
information  matrix. 

When  fitting  log-linear  models,  one  has  to  address  the  following  model 
selection  problem:  which  ip  terms  should  we  include  in  the  model?  This  is 
essentially  the  same  as  the  model  selection  problem  in  linear  regression. 

One  approach  is  is  to  use  AIC.  Let  M  denote  some  log-linear  model.  Differ¬ 
ent  models  correspond  to  setting  different  ip  terms  to  0.  Now  we  choose  the 
model  M  which  maximizes 


AIC(M)  =  ?(M) 


(19.2) 


where  \M\  is  the  number  of  parameters  in  model  M  and  £(M)  is  the  value 
of  the  log-likelihood  evaluated  at  the  mle  for  that  model.  Usually  the  model 
search  is  restricted  to  hierarchical  models.  This  reduces  the  search  space.  Some 
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also  claim  that  we  should  only  search  through  the  hierarchical  models  because 
other  models  are  less  interpretable. 

A  different  approach  is  based  on  hypothesis  testing.  The  model  that  includes 
all  possible  fi-terms  is  called  the  saturated  model  and  we  denote  it  by  Msat. 
Now  for  each  M  we  test  the  hypothesis 

H0  :  the  true  model  is  M  versus  Hi  :  the  true  model  is  Msat. 

The  likelihood  ratio  test  for  this  hypothesis  is  called  the  deviance. 

19.13  Definition.  For  any  submodel  M,  define  the  deviance  dev(M)  by 

dev(M)  =  2(Jsat  -  Im) 

where  £sat  is  the  log -likelihood  of  the  saturated  model  evaluated  at  the  mle 
and  Im  is  the  log -likelihood  of  the  model  M  evaluated  at  its  mle. 


19.14  Theorem.  The  deviance  is  the  likelihood  ratio  test  statistic  for 

Ho  :  the  model  is  M  versus  Hi  :  the  model  is  Msat. 

Under  Ho,  dev(M)  -A  xt  with  v  degrees  of  freedom  equal  to  the  difference  in 
the  number  of  parameters  between  the  saturated  model  and  M. 

One  way  to  find  a  good  model  is  to  use  the  deviance  to  test  every  sub-model. 
Every  model  that  is  not  rejected  by  this  test  is  then  considered  a  plausible 
model.  However,  this  is  not  a  good  strategy  for  two  reasons.  First,  we  will  end 
up  doing  many  tests  which  means  that  there  is  ample  opportunity  for  making 
Type  I  and  Type  II  errors.  Second,  we  will  end  up  using  models  where  we 
failed  to  reject  Ho .  But  we  might  fail  to  reject  Ho  due  to  low  power.  The 
result  is  that  we  end  up  with  a  bad  model  just  due  to  low  power. 

After  finding  a  “best  model”  this  way  we  can  draw  the  corresponding  graph. 

19.15  Example.  The  following  breast  cancer  data  are  from  Morrison  et  al. 
(1973).  The  data  are  on  diagnostic  center  (Xi),  nuclear  grade  (X2),  and  sur¬ 
vival  (X3): 


x2 

malignant 

malignant 

benign 

benign 

X3 

died 

survived 

died 

survived 

Xi  Boston 

35 

59 

47 

112 

Glamorgan 

42 

77 

26 

76 

The  saturated  log-linear  model  is: 


300 


19.  Log- Linear  Models 


Center 


Grade 


Survival 


FIGURE  19.5.  The  graph  for  Example  19.15. 


Variable 

/3j 

se 

Wj 

p-value 

(Intercept) 

3.56 

0.17 

21.03 

0.00  *** 

center 

0.18 

0.22 

0.79 

0.42 

grade 

0.29 

0.22 

1.32 

0.18 

survival 

0.52 

0.21 

2.44 

0.01  * 

center  x  grade 

-0.77 

0.33 

-2.31 

0.02  * 

center  x  survival 

0.08 

0.28 

0.29 

0.76 

grade  x  survival 

0.34 

0.27 

1.25 

0.20 

center  x  grade  x  survival 

0.12 

0.40 

0.29 

0.76 

The  best  sub-model,  selected  using  AIC  and  backward  searching  is: 


Variable 

03 

se 

Wj 

p-value 

(Intercept) 

3.52 

0.13 

25.62 

<  0.00  *** 

center 

0.23 

0.13 

1.70 

0.08 

grade 

0.26 

0.18 

1.43 

0.15 

survival 

0.56 

0.14 

3.98 

6.65e-05  *** 

center  x  grade 

-0.67 

0.18 

-3.62 

0.00  *** 

grade  x  survival 

0.37 

0.19 

1.90 

0.05 

The  graph  for  this  model  M  is  shown  in  Figure  19.5.  To  test  the  fit  of  this 
model,  we  compute  the  deviance  of  M  which  is  0.6.  The  appropriate  y2  has 
8  —  6  =  2  degrees  of  freedom.  The  p-value  is  P(x!  >  -6)  =  .74.  So  we  have  no 
evidence  to  suggest  that  the  model  is  a  poor  fit.  ■ 


19.6  Bibliographic  Remarks 

For  this  chapter,  I  drew  heavily  on  Whittaker  (1990)  which  is  an  excellent 
text  on  log-linear  models  and  graphical  models.  Some  of  the  exercises  are  from 
Whittaker.  A  classic  reference  on  log-linear  models  is  Bishop  et  al.  (1975). 


19.7  Exercises 
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19.7  Exercises 

1.  Solve  for  the  p^s  in  terms  of  the  /?’ s  in  Example  19.3. 

2.  Prove  Lemma  19.5. 

3.  Prove  Lemma  19.9. 

4.  Consider  random  variables  (Xi,  X2,  X3,  X4).  Suppose  the  log-density  is 

log  f(x)  =  ^(x)  +  1^12  (x)  +  ^13  (x)  +  ^24(x)  +  ^34^). 

(a)  Draw  the  graph  G  for  these  variables. 

(b)  Write  down  all  independence  and  conditional  independence  relations 
implied  by  the  graph. 

(c)  Is  this  model  graphical?  Is  it  hierarchical? 

5.  Suppose  that  parameters  p(x i,X2,x3)  are  proportional  to  the  following 
values: 


X2 

0 

0 

1 

1 

x3 

0 

1 

0 

1 

xi  0 

2 

8 

4 

16 

1 

16 

128 

32 

256 

Find  the  ip- terms  for  the  log-linear  expansion.  Comment  on  the  model. 

6.  Let  Xi, . . .  ,X4  be  binary.  Draw  the  independence  graphs  correspond¬ 
ing  to  the  following  log-linear  models.  Also,  identify  whether  each  is 
graphical  and/or  hierarchical  (or  neither). 

(a)  log /  =  7  +  llxi  +  2x2  +  1.5x3  +  17x4 

(b)  log  f  =  7  H-  1  lxi  H-  2x2  H-  1.5x3  S-  17x4  H-  12x2X3  ~\~  78x2X4  ~\~  3x3X4  ~\~ 
32x2X3X4 

(c)  log  /  =  7+llXi  +  2X2  +  1. 5X3  +  17X4  +  12X2X3  +  3X3X4+X1X4  +  2X1X2 

(d)  log  /  =  7  +  5055x1X2X3X4 


20 

Nonpar ametric  Curve  Estimation 


In  this  Chapter  we  discuss  nonparametric  estimation  of  probability  density 
functions  and  regression  functions  which  we  refer  to  as  curve  estimation  or 
smoothing. 

In  Chapter  7  we  saw  that  it  is  possible  to  consistently  estimate  a  cumulative 
distribution  function  F  without  making  any  assumptions  about  F.  If  we  want 
to  estimate  a  probability  density  function  f(x)  or  a  regression  function  r{x)  = 
E(Y\X  =  x )  the  situation  is  different.  We  cannot  estimate  these  functions 
consistently  without  making  some  smoothness  assumptions.  Correspondingly, 
we  need  to  perform  some  sort  of  smoothing  operation  on  the  data. 

An  example  of  a  density  estimator  is  a  histogram,  which  we  discuss  in 
detail  in  Section  20.2.  To  form  a  histogram  estimator  of  a  density  /,  we  divide 
the  real  line  to  disjoint  sets  called  bins.  The  histogram  estimator  is  a  piecewise 
constant  function  where  the  height  of  the  function  is  proportional  to  number 
of  observations  in  each  bin;  see  Figure  20.3.  The  number  of  bins  is  an  example 
of  a  smoothing  parameter.  If  we  smooth  too  much  (large  bins)  we  get  a 
highly  biased  estimator  while  if  we  smooth  too  little  (small  bins)  we  get  a 
highly  variable  estimator.  Much  of  curve  estimation  is  concerned  with  trying 
to  optimally  balance  variance  and  bias. 
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d(x) 

k 


This  is  a  function  of  the  data  This  is  the  point  at  which  we  are 

evaluating  g(-) 


FIGURE  20.1.  A  curve  estimate  g  is  random  because  it  is  a  function  of  the  data. 
The  point  x  at  which  we  evaluate  'g  is  not  a  random  variable. 

20.1  The  Bias- Variance  Tradeoff 


Let  g  denote  an  unknown  function  such  as  a  density  function  or  a  regression 
function.  Let  fin  denote  an  estimator  of  g.  Bear  in  mind  that  fin(x)  is  a  random 
function  evaluated  at  a  point  x.  The  estimator  is  random  because  it  depends 
on  the  data.  See  Figure  20.1. 

As  a  loss  function,  we  will  use  the  integrated  squared  error  (ISE):  1 


L{g,gn)  =  j  (g(u)  —  gn(u))2  du. 


(20.!) 


The  risk  or  mean  integrated  squared  error  (MISE)  with  respect  to 
squared  error  loss  is 

R(fJ)=mL(g,g)].  (20.2) 


20.1  Lemma.  The  risk  can  be  written  as 


where 


b(x)  =  E {gn(x))  -  g(x) 


is  the  bias  of^n(x)  at  a  fixed  x  and 


(20.3) 


(20.4) 


v(x)  =  V(gn(x))  =  E(  (gn (x)  -  E(gn(x))2) 


(20.5) 


is  the  variance  offin{x)  at  a  fixed  x. 


1We  could  use  other  loss  functions.  The  results  are  similar  but  the  analysis  is  much  more 
complicated. 


20.2  Histograms 
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FIGURE  20.2.  The  Bias- Variance  trade-off.  The  bias  increases  and  the  variance  de¬ 
creases  with  the  amount  of  smoothing.  The  optimal  amount  of  smoothing,  indicated 
by  the  vertical  line,  minimizes  the  risk  =  bias2  +  variance. 

In  summary, 


RISK  =  BIAS2  +  VARIANCE.  (20.6) 


When  the  data  are  over  smoothed,  the  bias  term  is  large  and  the  variance 
is  small.  When  the  data  are  undersmoothed  the  opposite  is  true;  see  Figure 
20.2.  This  is  called  the  bias- variance  tradeoff.  Minimizing  risk  corresponds 
to  balancing  bias  and  variance. 


20.2  Histograms 

Let  Xi, . . . ,  Xn  be  HD  on  [0, 1]  with  density  /.  The  restriction  to  [0, 1]  is  not 
crucial;  we  can  always  rescale  the  data  to  be  on  this  interval.  Let  m  be  an 
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integer  and  define  bins 


1  N 

),B2  = 

'  1 

2  \ 

,  . . . ,  Brn 

rn  —  1 

0,- 

- 

a 

m  j 

1 

m 

m  J 

m 

(20.7) 


Define  the  binwidth  h  =  1/m,  let  Vj  be  the  number  of  observations  in  Bj , 
let  pj  =  Vj/n  and  let  pj  =  fB  f(u)du. 

The  histogram  estimator  is  defined  by 


fn(x)  =  < 


f  pi/h  x  e  B i 
p2/h  x  G  B2 


Pm/h  %  C  Bm 


which  we  can  write  more  succinctly  as 


n 


fn(x)  =  J2ETI(x€Bj) 
3  =  1 


(20.8) 


To  understand  the  motivation  for  this  estimator,  let  Pj  =  JB  f{u)du  and  note 
that,  for  x  £  Bj  and  h  small, 


^{fn(x)) 


El 

h 


JBj  f{u)du  ~  f^h 

h  ~  f(x) 


20.2  Example.  Figure  20.3  shows  three  different  histograms  based  on  n  — 
1,266  data  points  from  an  astronomical  sky  survey.  Each  data  point  repre¬ 
sents  the  distance  from  us  to  a  galaxy.  The  galaxies  he  on  a  “pencilbeam” 
pointing  directly  from  the  Earth  out  into  space.  Because  of  the  finite  speed  of 
light,  looking  at  galaxies  farther  and  farther  away  corresponds  to  looking  back 
in  time.  Choosing  the  right  number  of  bins  involves  finding  a  good  tradeoff 
between  bias  and  variance.  We  shall  see  later  that  the  top  left  histogram  has 
too  few  bins  resulting  in  oversmoothing  and  too  much  bias.  The  bottom  left 
histogram  has  too  many  bins  resulting  in  undersmoothing  and  too  few  bins. 
The  top  right  histogram  is  just  right.  The  histogram  reveals  the  presence  of 
clusters  of  galaxies.  Seeing  how  the  size  and  number  of  galaxy  clusters  varies 
with  time,  helps  cosmologists  understand  the  evolution  of  the  universe.  ■ 


The  mean  and  variance  of  fn(x)  are  given  in  the  following  Theorem. 


20.3  Theorem.  Consider  fixed  x  and  fixed  m,  and  let  Bj  be  the  bin  containing 
x.  Then , 


E( fn{x )) 


|  and  V(/nW)  = 


(20.9) 


Undersmooth-ftd 


number  ol  bins 


FIGURE  20.3.  Three  versions  of  a  histogram  for  the  astronomy  data.  The  top  left 
histogram  has  too  few  bins.  The  bottom  left  histogram  has  too  many  bins.  The  top 
right  histogram  is  just  right.  The  lower,  right  plot  shows  the  estimated  risk  versus 
the  number  of  bins. 
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Let’s  take  a  closer  look  at  the  bias- variance  tradeoff  using  equation  (20.9) 
Consider  some  x  G  Bj.  For  any  other  u  G  Bj, 

f(u)  W  f(x)  +  (u-x)f(x) 


and  so 


Pj  =  /  f(u)du  ~  /  (  f(x)  +  (u  —  x)ff(x)  I  du 

B,  Jb~  V  / 


/(x)ft,  +  hf'(x)  (h  fj  -  0 


X 


Therefore,  the  bias  b(x)  is 


Pj 


b(x)  =  E (fn(x))  -  f(x)  =  —  -  f(x) 

_  f(x)h  +  hf’(x)  (h(j  -  \)  -x) 
- 

h 

=  f'(x )  (h  (j  ~  ~  A  ■ 


f(x) 


If  Xj  is  the  center  of  the  bin,  then 


b , 


b2{x)dx  ~  J  ( f\x ))2  — 


-  I  x  ]  dx 


r ^ 


=  ( f\Xj 


1 

2 


- I  x  ]  dx 


Therefore. 


m 


m 


b2{x)dx  =  ^ 


h  2  m 


b2(x)dx  w  ^2(f'(xj)) 
3  = 1 

h2  ^ 


h3 

12 


3= 1 


12 


(f(x))2dx. 


o 


Note  that  this  increases  as  a  function  of  h.  Now  consider  the  variance.  For  h 
small,  1  —  Pj  ~  1,  so 

Pj 


v(pc) 


nh 2 

f(x)h  +  hf'(x)  ( h  (j  -  i)  -  x) 


n/i2 


f(x) 

nh 


20.2  Histograms 
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20.4  Theorem.  Suppose  that  f  ( f\u))2du  <  oo.  Then 


[(f(u))2du+^ 


The  value  h*  that  minimizes  (20.10)  is 


h*  = 


n1/3  \f(f'(u))2du 


With  this  choice  of  binwidth, 


(20.10) 


(20.11) 


R(fnJ) 


77/2/3 


(20.12) 


where  C  =  (3/4)2/3(  /(/'(tx))2^ 


Theorem  20.4  is  quite  revealing.  We  see  that  with  an  optimally  chosen  bin- 
width,  the  MISE  decreases  to  0  at  rate  n-2/3.  By  comparison,  most  parametric 
estimators  converge  at  rate  n-1.  The  slower  rate  of  convergence  is  the  price 
we  pay  for  being  nonparametric.  The  formula  for  the  optimal  binwidth  h *  is 
of  theoretical  interest  but  it  is  not  useful  in  practice  since  it  depends  on  the 
unknown  function  /. 

A  practical  way  to  choose  the  binwidth  is  to  estimate  the  risk  function 
and  minimize  over  h.  Recall  that  the  loss  function,  which  we  now  write  as  a 
function  of  h,  is 


L(h) 


( fn(x )  -  f(x))2dx 


=  j  fn(x)dx  —  2  j  fn{x)f(x)dx  +  j  f2(x)dx. 

The  last  term  does  not  depend  on  the  binwidth  h  so  minimizing  the  risk  is 
equivalent  to  minimizing  the  expected  value  of 


J(h)=  /  fn(X)dx~  2  /  fn(x)f(x)dx. 
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We  shall  refer  to  E as  the  risk,  although  it  differs  from  the  true  risk  by 
the  constant  term  f  f2(x)dx. 


20.5  Definition.  The  cross-validation  estimator  of  risk  is 


(20.13) 


where  f(-i)  is  the  histogram  estimator  obtained  after  removing  the  ith 
observation.  We  refer  to  J(h)  as  the  cross-validation  score  or  estimated 
risk. 


20.6  Theorem.  The  cross-validation  estimator  is  nearly  unbiased: 

E (J(x))  nE(J(x)). 

In  principle,  we  need  to  recompute  the  histogram  n  times  to  compute  J(h). 
Moreover,  this  has  to  be  done  for  all  values  of  h.  Fortunately,  there  is  a 
shortcut  formula. 


20.7  Theorem.  The  following  identity  holds: 


j(h) 


(n  —  1  )h 


n+1  yS2 

(n- 


(20.14) 


20.8  Example.  We  used  cross-validation  in  the  astronomy  example.  The  cross- 
validation  function  is  quite  flat  near  its  minimum.  Any  m  in  the  range  of  73  to 
310  is  an  approximate  minimizer  but  the  resulting  histogram  does  not  change 
much  over  this  range.  The  histogram  in  the  top  right  plot  in  Figure  20.3  was 
constructed  using  m  =  73  bins.  The  bottom  right  plot  shows  the  estimated 
risk,  or  more  precisely,  A ,  plotted  versus  the  number  of  bins.  ■ 

Next  we  want  a  confidence  set  for  /.  Suppose  fn  is  a  histogram  with  m  bins 
and  binwidth  h  =  1/m.  We  cannot  realistically  make  confidence  statements 
about  the  fine  details  of  the  true  density  /.  Instead,  we  shall  make  confidence 
statements  about  /  at  the  resolution  of  the  histogram.  To  this  end,  define 


fn(x)  =  E(/n (x))  =  -^  for  x  G  Bj 


(20.15) 


where  pj  =  fB  f{u)du.  Think  of  f{x)  as  a  “histogramized”  version  of  /. 
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20.9  Definition.  A  pair  of  functions  (£n(x),un(x))  is  a  l— a  confidence 
band  (or  confidence  envelope)  if 


IP (  £(x)  <  fn(x)  <  u(x)  for  all  x  )  >  1  —  a. 


(20.16) 


20.10  Theorem.  Let  m  =  m(n)  be  the  number  of  bins  in  the  histogram  fr 
Assume  that  m(n)  oo  and  m(n)  log  n/n  — )►  0  as  n  — )►  oo.  Define 


£n(x)  =  ma x  V/n(i)-c,0 


U„(x) 


x)  +  c 


(20.17) 


where 


% a / (2m) 


(20.18) 


Then ,  (£n(x),un(x))  is  an  approximate  1  —  a  confidence  band. 


Proof.  Here  is  an  outline  of  the  proof.  From  the  central  limit  theorem,  pj  ^ 
N(pj,pj{l  —pj)/n).  By  the  delta  method,  yf]Xj  «  N(^Jpf,  1/(4 n)).  Moreover, 
it  can  be  shown  that  the  y/p^’s  are  approximately  independent.  Therefore, 


2  Vnh/Wj  -  VP; 


~  z 

~  ^3 


(20.19) 


where  Zi, . . . ,  Zm  ^  N( 0, 1).  Let 


£n(x)  <  fn(x)  <  un(x)  for  all  x 


max 

X 


'x)\  <  C 


Then. 


P(HC)  =  P  ( max 


x  — 


[x)  >  c  J  =  P  I  max  y  ~j^ 


Pj  ^ 

T  >c 


F  (max cl\fn  ~  >  za/(2m)j 

F  (max \Zj\  >  zaj{2m)  J  <  ^F  (\Zj\  >  za/{2ra)) 

'  '  7  =  1 


EOi 
m 


3  =  1 
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FIGURE  20.4.  95  percent  confidence  envelope  for  astronomy  data  using  m  =  73 
bins. 

20.11  Example.  Figure  20.4  shows  a  95  percent  confidence  envelope  for  the 
astronomy  data.  We  see  that  even  with  over  1,000  data  points,  there  is  still 
substantial  uncertainty.  ■ 


20.3  Kernel  Density  Estimation 


Histograms  are  discontinuous.  Kernel  density  estimators  are  smoother  and 
they  converge  faster  to  the  true  density  than  histograms. 

Let  Xj_, . . .  ,Xn  denote  the  observed  data,  a  sample  from  /.  In  this  chap¬ 
ter,  a  kernel  is  defined  to  be  any  smooth  function  K  such  that  K(x)  >  0, 
f  K(x)  dx  =  1,  f  xK(x)dx  =  0  and  cf2k  =  f  x2K(x)dx  >  0.  Two  examples  of 
kernels  are  the  Epanechnikov  kernel 


K(x) 


|(1  —  x2/5)/v/5  \x\  <  y/b 

0  otherwise 


(20.20) 


and  the  Gaussian  (Normal)  kernel  K(x) 


(27r)~1^2e~x2//2. 
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FIGURE  20.5.  A  kernel  density  estimator  /.  At  each  point  x,  f(x)  is  the  average 
of  the  kernels  centered  over  the  data  points  Xi.  The  data  points  are  indicated  by 
short  vertical  bars. 


20.12  Definition.  Given  a  kernel  K  and  a  positive  number  h,  called  the 

bandwidth,  the  kernel  density  estimator  is  defined  to  be 


(20.21) 


An  example  of  a  kernel  density  estimator  is  show  in  Figure  20.5.  The  kernel 
estimator  effectively  puts  a  smoothed-out  lump  of  mass  of  size  1  jn  over  each 
data  point  Xi.  The  bandwidth  h  controls  the  amount  of  smoothing.  When  h 
is  close  to  0,  fn  consists  of  a  set  of  spikes,  one  at  each  data  point.  The  height 
of  the  spikes  tends  to  infinity  as  h  — )►  0.  When  h  oo,  fn  tends  to  a  uniform 
density. 
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20.13  Example.  Figure  20.6  shows  kernel  density  estimators  for  the  astron¬ 
omy  data  using  three  different  bandwidths.  In  each  case  we  used  a  Gaussian 
kernel.  The  properly  smoothed  kernel  density  estimator  in  the  top  right  panel 
shows  similar  structure  as  the  histogram.  However,  it  is  easier  to  see  the  clus¬ 
ters  with  the  kernel  estimator.  ■ 


To  construct  a  kernel  density  estimator,  we  need  to  choose  a  kernel  K  and 
a  bandwidth  h.  It  can  be  shown  theoretically  and  empirically  that  the  choice 
of  K  is  not  crucial.  2  However,  the  choice  of  bandwidth  h  is  very  important. 
As  with  the  histogram,  we  can  make  a  theoretical  statement  about  how  the 
risk  of  the  estimator  depends  on  the  bandwidth. 


20.14  Theorem.  Under  weak  assumptions  on  f  and  K, 


1 


R(f,fn)  «  /  (/  (x)Y  + 


f  K2(x)dx 


nh 


(20.22) 


where  cf\  =  f  x2K(x)dx.  The  optimal  bandwidth  is 


h 


* 


c 


-2/5  1/5—1/5 


c2  c3 


n 


1/5 


where  c\  =  f  x2K(x)dx ,  C2  =  /  K(x)2dx  and  C3  =  j(f//(x))2dx , 
choice  of  bandwidth , 


(20.23) 
With  this 


R(fJn) 


r ^ 


c4 


n 


4/5 


for  some  constant  c4  >  0. 


PROOF.  Write  Kh(x,  X )  =  h~xK  ((x  -  X ) /h)  and  fn(x)  =  n~x  JT  Kh(x,  X%), 
Thus,  E[fn(x)}  =  E [Kh(x,X)]  and  X[fn(x)}  =  X)}.  Now, 


E  [Kh(x,X)\ 


K(u)f(x  —  hu)  du 


K(u) 


1 


f(x)  -  hf  (x)  +  -f  (x)  + 


du 


1 


=  f(x)-\--hzf  (x)  J  uzK{u)du 

since  f  K(x)  dx  =  1  and  J  xK(x)  dx  =  0.  The  bias  is 

E [Kh(x,X)\  -  f(x)  w  lcr|/i2/"(a;). 


2lt  can  be  shown  that  the  Epanechnikov  kernel  is  optimal  in  the  sense  of  giving  smallest 
asymptotic  mean  squared  error,  but  it  is  really  the  choice  of  bandwidth  which  is  crucial. 


0.0  0.1  0.2  0.002  0.004  0.006 


FIGURE  20.6.  Kernel  density  estimators  and  estimated  risk  for  the  astronomy  data. 
Top  left:  oversmoothed.  Top  right:  just  right  (bandwidth  chosen  by  cross-validation). 
Bottom  left:  undersmoothed.  Bottom  right:  cross-validation  curve  as  a  function  of 
bandwidth  h.  The  bandwidth  was  chosen  to  be  the  value  of  h  where  the  curve  is  a 


minimum. 
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By  a  similar  calculation, 


V[/„(a:) 


f(x)  f  K2(x )  dx 

- 

n  hn 


The  result  follows  from  integrating  the  squared  bias  plus  the  variance.  ■ 

We  see  that  kernel  estimators  converge  at  rate  n-4/5  while  histograms  con¬ 
verge  at  the  slower  rate  n-2/3.  It  can  be  shown  that,  under  weak  assumptions, 
there  does  not  exist  a  nonparametric  estimator  that  converges  faster  than 
n-4/5. 

The  expression  for  h*  depends  on  the  unknown  density  /  which  makes 
the  result  of  little  practical  use.  As  with  the  histograms,  we  shall  use  cross- 
validation  to  find  a  bandwidth.  Thus,  we  estimate  the  risk  (up  to  a  constant) 

by 

Th 

=  I  f2(x)dz--J2xi(Xi)  (20. 

/  77/ 

J  i=  1 

where  f-i  is  the  kernel  density  estimator  after  omitting  the  ith  observation. 


20.15  Theorem.  For  any  h  >  0, 


E 


j(h) 


E  [.J(hJ  . 


Also , 


j(h) 


i 

hn 2 


K 


* 


X 


Xj 


h 


+ 


nh 


K(  0) 


(20.25) 


J 


where  K*(x)  =  K^2\x)  —  2 K(x)  and  K^2\z)  =  f  K(z  —  y)K(y)dy.  In  par¬ 
ticular,  if  K  is  a  N(0,1)  Gaussian  kernel  then  K^2\z)  is  the  iV(0,  2)  density. 


We  then  choose  the  bandwidth  hn  that  minimizes  J(h)A  A  justification  for 
this  method  is  given  by  the  following  remarkable  theorem  due  to  Stone. 


20.16  Theorem  (Stone’s  Theorem).  Suppose  that  f  is  bounded.  Let  fh  denote 
the  kernel  estimator  with  bandwidth  h  and  let  hn  denote  the  bandwidth  chosen 
by  cross-validation.  Then, 


f  (/O)  -  fhn(x))  dx 
inf  ft  /  (f(x)  -  fh(x))  dx 


1. 


(20.26) 


3 For  large  data  sets,  /  and  (20.25)  can  be  computed  quickly  using  the  fast  Fourier  transform. 
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20.17  Example.  The  top  right  panel  of  Figure  20.6  is  based  on  cross-validation. 
These  data  are  rounded  which  problems  for  cross-validation.  Specifically,  it 
causes  the  minimizer  to  be  h  =  0.  To  overcome  this  problem,  we  added  a 
small  amount  of  random  Normal  noise  to  the  data.  The  result  is  that  J(h)  is 
very  smooth  with  a  well  defined  minimum.  ■ 

20.18  Remark.  Do  not  assume  that,  if  the  estimator  /  is  wiggly,  then  cross- 
validation  has  let  you  down.  The  eye  is  not  a  good  judge  of  risk. 


To  construct  confidence  bands,  we  use  something  similar  to  histograms. 
Again,  the  confidence  band  is  for  the  smoothed  version, 


fn  =  E(fn(x)) 


x  —  u 


h 


f(u)  du , 


of  the  true  density  /.  4  Assume  the  density  is  on  an  interval  (a,  b).  The  band 
is 

40)  =  40)  ~  q  se(x),  un(x)  =  fn(x)  +  q  se(x)  (20.27) 

where 


se(x) 

s2(x) 


Yz  0) 


q 

rn 


s(x) 


n 


1  n  _ 

—  -  Yn(x))2, 


n  —  1 


i= 1 


1  (  x  -  Xi 

hK\^r 


$ 

b  —  a 
u 


i  ( 1  +  (1— «)17 


m 


where  uj  is  the  width  of  the  kernel.  In  case  the  kernel  does  not  have  finite 
width  then  we  take  cu  to  be  the  effective  width,  that  is,  the  range  over  which 
the  kernel  is  non-negligible.  In  particular,  we  take  uo  =  3h  for  the  Normal 
kernel. 


20.19  Example.  Figure  20.7  shows  approximate  95  percent  confidence  bands 
for  the  astronomy  data.  ■ 


4 


This  is  a  modified  version  of  the  band  described  in  Chaudhuri  and  Marron  (1999). 
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FIGURE  20.7.  95  percent  confidence  bands  for  kernel  density  estimate  for  the  as¬ 
tronomy  data. 


Suppose  now  that  the  data  Xi  =  {Xu, . . . ,  Xi(i)  are  d-dimensional.  The  ker¬ 
nel  estimator  can  easily  be  generalized  to  d  dimensions.  Let  h  =  {hi, . . . ,  hd) 
be  a  vector  of  bandwidths  and  define 


1 

n 


Y^Khix-Xt) 

i= 1 


(20.28) 


where 


(20.29) 


where  hi, ...  ,hd  are  bandwidths.  For  simplicity,  we  might  take  hj  =  Sjh  where 
Sj  is  the  standard  deviation  of  the  jth  variable.  There  is  now  only  a  single 
bandwidth  h  to  choose.  Using  calculations  like  those  in  the  one-dimensional 
case,  the  risk  is  given  by 


R(f,  fn) 


1  4 


fjj  fkkdx 


nhi  ■■■hd 

where  fjj  is  the  second  partial  derivative  of  /.  The  optimal  bandwidth  satisfies 
hi  ~  cin-1/(4+c^,  leading  to  a  risk  of  order  From  this  fact,  we  see 


20.4  Nonparametric  Regression 


319 


that  the  risk  increases  quickly  with  dimension,  a  problem  usually  called  the 
curse  of  dimensionality.  To  get  a  sense  of  how  serious  this  problem  is, 
consider  the  following  table  from  Silverman  (1986)  which  shows  the  sample 
size  required  to  ensure  a  relative  mean  squared  error  less  than  0.1  at  0  when 
the  density  is  multivariate  normal  and  the  optimal  bandwidth  is  selected: 

Sample  Size 

T~ 

19 
67 
223 
768 
2790 

10.700 

43.700 
187,000 
842,000 

This  is  bad  news  indeed.  It  says  that  having  842,000  observations  in  a  ten¬ 
dimensional  problem  is  really  like  having  4  observations  in  a  one-dimensional 
problem. 

20.4  Nonparametric  Regression 

Consider  pairs  of  points  (#i,  Yi), . . . ,  (xn,  Yn)  related  by 

Yi  =  r(xi )  +  Ci  (20.30) 

where  E(e^)  =  0.  We  have  written  the  x^s  in  lower  case  since  we  will  treat 
them  as  fixed.  We  can  do  this  since,  in  regression,  it  is  only  the  mean  of  Y 
conditional  on  x  that  we  are  interested  in.  We  want  to  estimate  the  regression 
function  r(x)  =  E(T|X  =  x). 

There  are  many  nonparametric  regression  estimators.  Most  involve  esti¬ 
mating  r(x)  by  taking  some  sort  of  weighted  average  of  the  Yi  s,  giving  higher 
weight  to  those  points  near  x.  A  popular  version  is  the  Nadaraya- Watson 
kernel  estimator. 

20.20  Definition.  The  Nadaraya- Watson  kernel  estimator  is  defined 

by 

n 

r(x)  =  y yWi(x)Yj 

i=  1 


Dimension 

T 

2 

3 

4 

5 

6 

7 

8 
9 

10 


(20.31) 
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where  K  is  a  kernel  and  the  weights  Wi(x)  are  given  by 


Wi(x) 


(20.32) 


The  form  of  this  estimator  comes  from  first  estimating  the  joint  density 
f(x,y)  using  kernel  density  estimation  and  then  inserting  the  estimate  into 
the  formula, 


r(x)  =E(Y\X  =  x) 


yf(y\x)dy 


J  yf(x,y)dy 
f  f(x,y)dy  ' 


20.21  Theorem.  Suppose  thatV(ei)  =  a2.  The  risk  of  the  Nadaraya-Watson 
kernel  estimator  is 


R(rn,r) 


h‘ 


r ^ 


4 


+ 


x2K2(x)dx 

a2  f  K2(x)dx 
nhf(x) 


r"(x)  +  2rr  (x) 


f(x) 

f(x) 


dx 


dx. 


(20.33) 


The  optimal  bandwidth  decreases  at  rate  n  1/5  and  with  this  choice  the  risk 
decreases  at  rate  n-4/5. 


In  practice,  to  choose  the  bandwidth  h  we  minimize  the  cross  validation 
score 

n 

J{h)  =  YJ(Yi~r-i(xl))2  (20.34) 

i= 1 

where  is  the  estimator  we  get  by  omitting  the  ith  variable.  Fortunately, 
there  is  a  shortcut  formula  for  computing  J. 

20.22  Theorem.  J  can  be  written  as 


n 


J{h)  =  Y,(Yi  -  r(xi))' 


1 


i=  1 


1 


jm 


Y7=iK 


h 


(20.35) 


20.23  Example.  Figures  20.8  shows  cosmic  microwave  background  (CMB) 
data  from  BOOMERaNG  (Netterheld  et  al.  (2002)),  Maxima  (Lee  et  al. 
(2001)),  and  DASI  (Halverson  et  al.  (2002))).  The  data  consist  of  n  pairs 
(#i,  U)  ,  . . .,  (xn,Yn)  where  Xf  is  called  the  multipole  moment  and  Yf  is  the 
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estimated  power  spectrum  of  the  temperature  fluctuations.  What  you  are  see¬ 
ing  are  sound  waves  in  the  cosmic  microwave  background  radiation  which  is 
the  heat,  left  over  from  the  big  bang.  If  r(x)  denotes  the  true  power  spectrum, 
then 

Yi  =  r(xi)  +  €i 

where  is  a  random  error  with  mean  0.  The  location  and  size  of  peaks  in 
r(x)  provides  valuable  clues  about  the  behavior  of  the  early  universe.  Figure 
20.8  shows  the  fit  based  on  cross-validation  as  well  as  an  undersmoothed  and 
oversmoothed  fit.  The  cross-validation  fit  shows  the  presence  of  three  well- 
defined  peaks,  as  predicted  by  the  physics  of  the  big  bang.  ■ 

The  procedure  for  finding  confidence  bands  is  similar  to  that  for  density 
estimation.  However,  we  first  need  to  estimate  a2.  Suppose  that  the  are 
ordered.  Assuming  r(x)  is  smooth,  we  have  r(xi+ 1)  —  r(xi)  ~  0  and  hence 


r(xi+ 1)  +  ei+ 1 


r(xi)  +  Ci 


~  ei+ 1  —  ei 


and  hence 


Y(Yi+1  -  Yi)  *  V(ei+i  -  Ci)  =  V(e<+i)  +  Y(et)  =  2a2. 

We  can  thus  use  the  average  of  the  n  —  1  differences  —  Yi  to  estimate  a2. 
Hence,  define 

1  n—  1 

?2  =  2(^l )E(y‘+1“y‘)2-  (20-36) 

As  with  density  estimate,  the  confidence  band  is  for  the  smoothed  version 
rn(x)  =  E (rn(x))  of  the  true  regression  function  r. 
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Undersmoothed 


LO 


Just  Right  (Using  cross— valdiation) 


bandwidth 


FIGURE  20.8.  Regression  analysis  of  the  CMB  data.  The  first  fit  is  undersmoothed, 
the  second  is  oversmoothed,  and  the  third  is  based  on  cross-validation.  The  last 
panel  shows  the  estimated  risk  versus  the  bandwidth  of  the  smoother.  The  data  are 
from  BOOMERaNG,  Maxima,  and  DASI. 
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Confidence  Bands  for  Kernel  Regression 

An  approximate  1  —  a  confidence  band  for  rn  (x)  is 

£n(x)  =  rn(x)  -  q  se(x),  un(x)  =  rn(x)  +  q  se(x)  (20.37) 

where 


b  —  a 

rn  =  - , 

uo 


a  is  defined  in  (20.36)  and  uj  is  the  width  of  the  kernel.  In  case  the  kernel 
does  not  have  finite  width  then  we  take  u  to  be  the  effective  width,  that 
is,  the  range  over  which  the  kernel  is  non-negligible.  In  particular,  we  take 
uo  =  3h  for  the  Normal  kernel. 

20.24  Example.  Figure  20.9  shows  a  95  percent  confidence  envelope  for  the 
CMB  data.  We  see  that  we  are  highly  confident  of  the  existence  and  position 
of  the  first  peak.  We  are  more  uncertain  about  the  second  and  third  peak. 
At  the  time  of  this  writing,  more  accurate  data  are  becoming  available  that 
apparently  provide  sharper  estimates  of  the  second  and  third  peak.  ■ 

The  extension  to  multiple  regressors  X  =  (Xi, . . .  ,XP)  is  straightforward. 
As  with  kernel  density  estimation  we  just  replace  the  kernel  with  a  multivari¬ 
ate  kernel.  However,  the  same  caveats  about  the  curse  of  dimensionality  apply. 
In  some  cases,  we  might  consider  putting  some  restrictions  on  the  regression 
function  which  will  then  reduce  the  curse  of  dimensionality.  For  example, 
additive  regression  is  based  on  the  model 

p 

Y  =  j(Xj)  +  e.  (20.38) 

3  = 1 

Now  we  only  need  to  fit  p  one-dimensional  functions.  The  model  can  be  en¬ 
riched  by  adding  various  interactions,  for  example, 

p 

Y  =  ^>00)  +  '£rjk{XjXk)  +  e. 

j= 1  j<k 

Additive  models  are  usually  fit  by  an  algorithm  called  backfitting. 


(20.39) 
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?OQ  400  <400  BOO  lOOO 


FIGURE  20.9.  95  percent  confidence  envelope  for  the  CMB  data. 

Backfitting 

1.  Initialize  r i(#i),  . . . ,  rp(xp). 

2.  For  j  =  1  , . . .  ,p: 

(a)  Let  Ci  =Yi  —  Y.s^j  rs(xi)- 

(b)  Let  rj  be  the  function  estimate  obtained  by  regressing  the  e^’s 

on  the  jth  covariate. 

3.  If  converged  STOP.  Else,  go  back  to  step  2. 


Additive  models  have  the  advantage  that  they  avoid  the  curse  of  dimension¬ 
ality  and  they  can  be  ht  quickly,  but  they  have  one  disadvantage:  the  model 
is  not  fully  nonparametric.  In  other  words,  the  true  regression  function  r(x) 
may  not  be  of  the  form  (20.38). 


20.5  Appendix 


Confidence  Sets  and  Bias.  The  confidence  bands  we  computed  are  not 
for  the  density  function  or  regression  function  but  rather  for  the  smoothed 
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function.  For  example,  the  confidence  band  for  a  kernel  density  estimate  with 
bandwidth  h  is  a  band  for  the  function  one  gets  by  smoothing  the  true  function 
with  a  kernel  with  the  same  bandwidth.  Getting  a  confidence  set  for  the  true 
function  is  complicated  for  reasons  we  now  explain. 

Let  fn{x)  denote  an  estimate  of  the  function  f(x).  Denote  the  mean  and 
standard  deviation  of  fn(x)  by  fn(x)  and  sn(x).  Then, 

f„(x)  -  f(x)  =  fn(x)  —  ~fn(x)  +  fn{x)  -  f(x) 
sn(x)  sn(x)  sn(x) 

Typically,  the  first  term  converges  to  a  standard  Normal  from  which  one  de¬ 
rives  confidence  bands.  The  second  term  is  the  bias  divided  by  the  standard 
deviation.  In  parametric  inference,  the  bias  is  usually  smaller  than  the  stan¬ 
dard  deviation  of  the  estimator  so  this  term  goes  to  0  as  the  sample  size 
increases.  In  nonparametric  inference,  optimal  smoothing  leads  us  to  balance 
the  bias  and  the  standard  deviation.  Thus  the  second  term  does  not  vanish 
even  with  large  sample  sizes.  This  means  that  the  confidence  interval  will  not 
be  centered  around  the  true  function  /. 

20.6  Bibliographic  Remarks 

Two  very  good  books  on  density  estimation  are  Scott  (1992)  and  Silverman 
(1986).  The  literature  on  nonparametric  regression  is  very  large.  Two  good 
starting  points  are  Hardle  (1990)  and  Loader  (1999).  The  latter  emphasizes  a 
class  of  techniques  called  local  likelihood  methods. 


20.7  Exercises 


1 .  Let  X\ , . . . ,  Xn  ~ 
boxcar  kernel: 


/  and  let  fn  be  the  kernel  density  estimator  using  the 


i  ^  ^  i 


1  —  -  ^  r  <  - 

K(x)  =  i  2  ^  X  ^  2 

'  '  '  0  otherwise. 


(a)  Show  that 


^  1  px+{h/2) 

E(/(b))  =  r  f(y)dy 


h 


x—(h/ 2) 


and 


V  (/(*))  = 


1 


nh 2 


'X+(h/ 2) 


x—(h/ 2) 


f{y)dy 


‘X+(h/ 2) 


x—  (h/ 2) 


f(y)dy 
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^  p 

(b)  Show  that  if  h  0  and  nh  oo  as  n  oo,  then  fn(x) — >  /(x). 

2.  Get  the  data  on  fragments  of  glass  collected  in  forensic  work  from  the 
book  website.  Estimate  the  density  of  the  first  variable  (refractive  in¬ 
dex)  using  a  histogram  and  use  a  kernel  density  estimator.  Use  cross- 
validation  to  choose  the  amount  of  smoothing.  Experiment  with  different 
binwidths  and  bandwidths.  Comment  on  the  similarities  and  differences. 
Construct  95  percent  confidence  bands  for  your  estimators. 

3.  Consider  the  data  from  question  2.  Let  Y  be  refractive  index  and  let 
x  be  aluminum  content  (the  fourth  variable).  Perform  a  nonparametric 
regression  to  fit  the  model  Y  =  /(x)  +  e.  Use  cross-validation  to  estimate 
the  bandwidth.  Construct  95  percent  confidence  bands  for  your  estimate. 

4.  Prove  Lemma  20.1. 

5.  Prove  Theorem  20.3. 

6.  Prove  Theorem  20.7. 

7.  Prove  Theorem  20.15. 

8.  Consider  regression  data  (xi,  Yi), . . . ,  (xn,  Yn).  Suppose  that  0  <  Xi  <  1 
for  all  i.  Define  bins  Bj  as  in  equation  (20.7).  For  x  E  Bj  define 

?n(x)  =  Yj 

where  Yj  is  the  mean  of  all  the  IV  s  corresponding  to  those  x^’s  in  Bj. 
Find  the  approximate  risk  of  this  estimator.  From  this  expression  for 
the  risk,  find  the  optimal  bandwidth.  At  what  rate  does  the  risk  go  to 
zero? 

9.  Show  that  with  suitable  smoothness  assumptions  on  r(x),  a2  in  equation 
(20.36)  is  a  consistent  estimator  of  a2. 


10.  Prove  Theorem  20.22. 


21 

Smoothing  Using  Orthogonal  Functions 


In  this  chapter  we  will  study  an  approach  to  nonparametric  curve  estima¬ 
tion  based  on  orthogonal  functions.  We  begin  with  a  brief  introduction  to 
the  theory  of  orthogonal  functions,  then  we  turn  to  density  estimation  and 
regression. 


21.1  Orthogonal  Functions  and  L 2  Spaces 

Let  v  =  (ui,  112,113)  denote  a  three-dimensional  vector,  that  is,  a  list  of  three 
real  numbers.  Let  V  denote  the  set  of  all  such  vectors.  If  a  is  a  scalar  (a 
number)  and  v  is  a  vector,  we  define  av  =  (av i,  av 2,  av 3).  The  sum  of  vectors 
v  and  w  is  defined  by  v  +  w  =  (v\  +  w\ ,  V2  +  W2 ,  V3  +  ws ) .  The  inner  product 
between  two  vectors  v  and  w  is  defined  by  (v,w)  =  Xa=i  viwi-  The  norm 
(or  length)  of  a  vector  v  is  defined  by 


v 


(2L!) 


Two  vectors  are  orthogonal  (or  perpendicular)  if  (v,w)  =  0.  A  set  of 
vectors  are  orthogonal  if  each  pair  in  the  set  is  orthogonal.  A  vector  is  normal 


if 


v 


=  1 
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Let  0i  =  (1,  0,  0),  02  =  (0, 1,0),  03  =  (0,  0, 1).  These  vectors  are  said  to  be 
an  orthonormal  basis  for  V  since  they  have  the  following  properties: 

(i)  they  are  orthogonal; 

(ii)  they  are  normal; 

(iii)  they  form  a  basis  for  V,  which  means  that  any  v  E  V  can  be  written  as  a 
linear  combination  of  0i,  02,  03 : 

3 

v  =  where  (3j  =  (0j,u).  (21.2) 

3= 1 

For  example,  if  v  =  (12,3,4)  then  v  =  120i  +  302  +  403.  There  are  other 
orthonormal  bases  for  V,  for  example, 


01  = 


1  1  1  \  ,  (l  1  \  ,  /II  2 

Y  i>2=(  -7=, - 7=,0  ,  = 


V?V?V3j 


x/2’  y/2 


Vg’  Vg’  Vg, 


You  can  check  that  these  three  vectors  also  form  an  orthonormal  basis  for  V. 
Again,  if  v  is  any  vector  then  we  can  write 


v  =  J^0j0j  where  f3j  =  ('ipj^v 

3  =  1 


For  example,  if  v  =  (12,  3, 4)  then 


v  =  10.9701  +  6.3602  +  2.8603. 


Now  we  make  the  leap  from  vectors  to  functions.  Basically,  we  just  replace 
vectors  with  functions  and  sums  with  integrals.  Let  Z/2(a,  6)  denote  all  func¬ 
tions  defined  on  the  interval  [a,  b]  such  that  Jb  f(x)2dx  <  oo: 


Z/2(a,  b)  —  / /  :  [a,  6]  — >>  M,  J  f(x)2dx  <  oo 


(21.3) 


We  sometimes  write  L2  instead  of  Z/2(a,  6).  The  inner  product  between  two 
functions  /,  g  G  L2  is  defined  by  J  f(x)g(x)dx.  The  norm  of  /  is 


(21.4) 


Two  functions  are  orthogonal  if  f  f(x)g{x)dx  =  0.  A  function  is  normal  if 

ll/ll  =  !• 

A  sequence  of  functions  0i,  02,  03,  •  •  •  is  orthonormal  if  f  <f>j(x)dx  =  1  for 
each  j  and  f  cf)i(x)(f)j(x)dx  =  0  for  i  ^  j.  An  orthonormal  sequence  is  com¬ 
plete  if  the  only  function  that  is  orthogonal  to  each  0j  is  the  zero  function. 
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In  this  case,  the  functions  <j)  1, 02,  03?  •  •  •  form  in  basis,  meaning  that  if  /  E  L2 
then  /  can  be  written  as1 

00  ,.b 

ff)  =  where  f3j  =  /  f(x)cj)j(x)dx.  (21.5) 

7=1  ' a 


A  useful  result  is  Parseval’s  relation  which  says  that 


/oo 

fix)  dx  =  ^2  Pj 

3= 1 


P 


(21.6) 


where  (3  =  (/5i,  /32,  •  •  •) 


21.1  Example.  An  example  of  an  orthonormal  basis  for  ^(0, 1)  is  the  cosine 
basis  defined  as  follows.  Let  (j)o(x)  =  1  and  for  j  >  1  define 


4>j(x)  =  y/2  cos(j7rx). 
The  hrst  six  functions  are  plotted  in  Figure  21.1. 


(21.7) 


21.2  Example.  Let 


f(x)  =  \f  x(l  —  x)  sin 


2.1t r 


(x  +  .05) 

which  is  called  the  “doppler  function.”  Figure  21.2  shows  /  (top  left)  and  its 
approximation 


j 


fjf)  = 

3  = 1 

with  J  equal  to  5  (top  right),  20  (bottom  left),  and  200  (bottom  right). 
As  J  increases  we  see  that  fj(x)  gets  closer  to  f(x).  The  coefficients  (3j  = 
fo  f{x)cj)j{x)dx  were  computed  numerically.  ■ 


21.3  Example.  The  Legendre  polynomials  on  [—1, 1]  are  defined  by 

1  d3 

PAx)  =  :—-(x2  -  1  )J,  j  =  0,1,2,...  (21.8) 

JV  J  2 Pj\dxPy  y  ’  J  ’  ’  ’  v  J 

It  can  be  shown  that  these  functions  are  complete  and  orthogonal  and  that 


J  Pj(x)dx  = 


2j  +  l 


(2!. 9) 


1The  equality  in  the  displayed  equation  means  that  f  (f(x)  —  fn{x))2dx  — ¥  0  where  fn{x) 
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FIGURE  21.1.  The  first  six  functions  in  the  cosine  basis. 


FIGURE  21.2.  Approximating  the  doppler  function  with  its  expansion 
in  the  cosine  basis.  The  function  /  (top  left)  and  its  approximation 
fj(x )  =  ^2j=i  Pj(f>j(x)  with  J  equal  to  5  (top  right),  20  (bottom  left), 

and  200  (bottom  right).  The  coefficients  f3j  =  f{pc)4>j(x)dx  were 
computed  numerically. 
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It  follows  that  the  functions  <j)j{x)  =  \/ (2 j  +  l)/2 Pj(x)y  j  =  0, 1, . . .  form  an 
orthonormal  basis  for  L2(— 1, 1).  The  first  few  Legendre  polynomials  are: 


Po(x) 

Pl(x) 

P2(x) 

Ps(x) 


1 


X. 

-  (  3x2  —  1  ) ,  and 

2  i  n 

-  I  bx3  —  3x  I . 

2 


These  polynomials  may  be  constructed  explicitly  using  the  following  recursive 
relation: 


p  M  _  (2J  +  l)xPj(x)  -  jPj-l(x) 

Pj+Ux)  -  - 7T“i - 


(21.10) 


The  coefficients  /?i,/?2?  •  •  •  are  related  to  the  smoothness  of  the  function  /. 
To  see  why,  note  that  if  /  is  smooth,  then  its  derivatives  will  be  finite.  Thus  we 
expect  that,  for  some  k,  (f^  (x))2dx  <  oo  where  is  the  kth  derivative 
of  /.  Now  consider  the  cosine  basis  (21.7)  and  let  f(x)  =  Yl'jLo  Pj&ji30)-  Then, 


The  only  way  that  can  be  finite  is  if  the  /3^’s  get  small  when 

J  J 

j  gets  large.  To  summarize: 


If  the  function  /  is  smooth,  then  the  coefficients  (3j  will  be  small 
when  j  is  large. 


For  the  rest  of  this  chapter,  assume  we  are  using  the  cosine  basis  unless 
otherwise  specified. 


21.2  Density  Estimation 

Let  Xi, . . .  ,Xn  be  HD  observations  from  a  distribution  on  [0, 1]  with  density 
/.  Assuming  /  G  L2  we  can  write 

oo 

f(x)  = 

j=0 

where  0i,  02?  •  •  •  is  an  orthonormal  basis.  Define 

Ti 

Pj  =  ~ 

I  V 

i=  1 


(21.11) 


332 


21.  Smoothing  Using  Orthogonal  Functions 


21.4  Theorem.  The  mean  and  variance  of  (3j  are 


E(&  =/?;,  M/^.7 


j 


n 


where 


aj  =  Y((f>j(Xi)) 


{4>j{x)  -  pj)2f(x)dx. 


Proof.  The  mean  is 


E(/t)  = 


n 


I  V 

i=  1 

E(^(Xi)) 

f  <f>j(x)f(x)dx  =  (3j 


(21.12) 

(21.13) 


The  calculation  for  the  variance  is  similar.  ■ 

Hence,  f3j  is  an  unbiased  estimate  of  f3j.  It  is  tempting  to  estimate  /  by 
Y^jLi  but  this  turns  out  to  have  a  very  high  variance.  Instead,  consider 

the  estimator 

j 

f(x)  =  Y^^Ax)-  (21.14) 

3  = 1 

The  number  of  terms  J  is  a  smoothing  parameter.  Increasing  J  will  decrease 
bias  while  increasing  variance.  For  technical  reasons,  we  restrict  J  to  he  in 
the  range 

1  <  J  <  p 

where  p  =  p(n)  =  y/n .  To  emphasize  the  dependence  of  the  risk  function  on 
J,  we  write  the  risk  function  as  R(J). 

21.5  Theorem.  The  risk  of  f  is 


J  <7? 


oo 


x;  $ 

3  =  J+ 1 


3 


An  estimate  of  the  risk  is 


J  ?2 


v 


R(J)  =  Tf  +  E 


a 


3 


3= i 


3  =  J+ 1 


n 


+ 


where  a+  =  max{a,  0}  and 


1  _n 

— £(*<*> 

i= 1 


(21.15) 


(21.16) 


(21.17) 
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To  motivate  this  estimator,  note  that  cr-  is  an  unbiased  estimate  of  crl-  and 
(3?  —  dj  is  an  unbiased  estimator  of  /??.  We  take  the  positive  part  of  the  latter 
term  since  we  know  that  f3j  cannot  be  negative.  We  now  choose  1  <  J  <  p  to 
minimize  R(f,  /).  Here  is  a  summary: 

Summary  of  Orthogonal  Function  Density  Estimation 

1.  Let 

Tl 

I L 

i= 1 

2.  Choose  J  to  minimize  R(J)  over  1  <  J  <  p  =  ^fn  where  R  is  given  in 
equation  (21.16). 

3.  Let 

j 

f(x)  = 

3  =  1 


The  estimator  fn  can  be  negative.  If  we  are  interested  in  exploring  the 
shape  of  /,  this  is  not  a  problem.  However,  if  we  need  our  estimate  to  be  a 
probability  density  function,  we  can  truncate  the  estimate  and  then  normalize 
it.  That  is,  we  take  /*  =  max{/n  0*0,0}/  /„  max{/„  (u),  0 }du. 

Now  let  us  construct  a  confidence  band  for  /.  Suppose  we  estimate  /  using 
J  orthogonal  functions.  We  are  essentially  estimating  fj(x )  =  Pj^j (x) 

not  the  true  density  f(x)  =  Yl'jLi  f3j<t>j(x)-  Thus,  the  conhdence  band  should 
be  regarded  as  a  band  for  fj(x). 
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6j  ~  iV(0,l),  and  therefore 


j 


2  J  r^2 


r  1  ^  22^  U2  d  ^ 
L  ~  >  cr.e.  <  -  >  =  — 

n  J  J  n  J  n 

3  = 1  J  =  1 


Xj 


Thus  we  have,  approximately,  that 


r  K2  2  \  / K 2  2  iX2  2 

P  I  L  >  —  xU  <  P  — Xj  >  — X2j, 


n 


n 


n 


a 


cc 


(21.20) 


Also. 


j 


max  |/j(x)  —  fj(x)\  <  max  |0j(x)|  | ~  Pj 

rf>  rf> 

tT  tv 


J  =  1 


J 


< 


Kj20j~Pj 

3  = 1 


<  \/jAT 


J 


A,EU-/t)2 

i=i 

=  VjkVl 

where  the  third  inequality  is  from  the  Cauchy-Schwartz  inequality  (Theorem 
4.8).  So, 


Jx2 


P  |  max  \fj(x)  -  fj(x) |  >  Kl\!~^  J  <  P  W1  K^L  >  K 


( 


2  .  / 

n 


=  F  I  VL>  K' 


Xj.q 

n 


21.7  Example.  Let 


K2y2t 

P  I  L  >  - 


n 


<  a. 


5  1  ^ 

f(x)  =  q4>{x',  °,  !)  +  ^  E  x ’  -1) 


6 


3  = 1 


where  </>(x;  /i,  cr)  denotes  a  Normal  density  with  mean  /i  and  standard  deviation 
cr,  and  (/ii, . . . ,  /is)  =  (—1,  —1/2,  0, 1/2, 1).  Marron  and  Wand  (1992)  call  this 
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FIGURE  21.3.  The  top  plot  is  the  true  density  for  the  Bart  Simpson  distribution 
(rescaled  to  have  most  of  its  mass  between  0  and  1).  The  bottom  plot  is  the  orthog¬ 
onal  function  density  estimate  and  95  percent  confidence  band. 

“the  claw”  although  the  “Bart  Simpson”  might  be  more  appropriate.  Figure 
21.3  shows  the  true  density  as  well  as  the  estimated  density  based  on  n  = 
5,  000  observations  and  a  95  percent  confidence  band.  The  density  has  been 
rescaled  to  have  most  of  its  mass  between  0  and  1  using  the  transformation 
y  =  (x  +  3)/6.  ■ 


21.3  Regression 


Consider  the  regression  model 


Yi  =  r(xi)  +  ei,  i  = 


(21.21) 


where  the  are  independent  with  mean  0  and  variance  a2.  We  will  initially 
focus  on  the  special  case  where  xi  =  i/n.  We  assume  that  r  E  £2(0, 1)  and 
hence  we  can  write 


00 

r{x)  ='s^[3j(j)j{x)  where  (3j 
3= 1 


r(x)<j)j(x)dx 


(21.22) 


where  </>i,  </>2?  •  •  •  where  is  an  orthonormal  basis  for  [0, 1  . 
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Define 

1  n 

Pj  =  -22Yi4>j{xi),  j  =  1,2,...  (21.23) 

TX 

i= 1 

Since  /3j  is  an  average,  the  central  limit  theorem  tells  us  that  /3j  will  be 
approximately  Normally  distributed. 


21.8  Theorem. 


(3  j  ~  N 


n 


(21.24) 


Proof.  The  mean  of  (3j  is 


n 


n 


E(/t')  =  ly]E (Yj)(j)j(xi)  =  t  r(xj)(pj(xi) 


i=  1 


n 


i=l 


rv 


r{x)(j>j{x)dx  =  P 


3 


where  the  approximate  equality  follows  from  the  definition  of  a  Riemann  in¬ 
tegral:  JT  Anh(xi)  — f1  h{pc)dx  where  An  =  1/n.  The  variance  is 


r ^ 


since  J (j)2{x)dx  =  1. 


Let 


and  let 


1 


n 


v(&)  =  Z5EV«)^0 


n 

<7 


i=l 
2  n 


.2  -  n 


rU  z — /  j  n  n  J 

i=  1  i=l 


CT 


n 


cr 


n 


j 


Hx)  =  E^U^t 

3  = 1 

R(J)  =  vJ('i*)-n*)?dz 


be  the  risk  of  the  estimator. 


21.9  Theorem.  T/ie  risk  R(J)  of  the  estimator  rn{x)  =  Pj^ji00)  i s 


j  =  1  rjYj 


oo 


R(J)  =  —  +  E  $ 

j=j+i 


n 


(21.25) 
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To  estimate  for  a2  =  V(e^)  we  use 


i  £  3 

i=n-k-\- 1 


(21.26) 


where  k  =  n/4.  To  motivate  this  estimator,  recall  that  if  /  is  smooth,  then 
/3j  «  0  for  large  j.  So,  for  j  >  fc,  /3j  «  iV(0,  cr2/n)  and  thus,  fa  aZj  j \fn  for 
for  j  >  kj  where  Zj  ^  iV(0, 1).  Therefore, 


91  =  i  £ 

i=n— fc+1  i=n— fc+1 

2  n  2 

a  Z2  _  ^  2 

Z_^  @3  £ 

i=n— k-\-l 

since  a  sum  of  k  Normals  has  a  xt  distribution.  Now  E(y|)  =  k  and  hence 
E(cr2)  «  cr2.  Also,  V(x^)  =  2k  and  hence  V(cr2)  «  (cr4//c2)(2/c)  =  (2<74//c)  — >►  0 
as  n  — )►  oo.  Thus  we  expect  a2  to  be  a  consistent  estimator  of  a2.  There  is 
nothing  special  about  the  choice  k  =  n/ 4.  Any  k  that  increases  with  n  at  an 
appropriate  rate  will  suffice. 

We  estimate  the  risk  with 


R(J)  =  Ja-+  E  NET 


j=j+ 1 


(21.27) 


21.10  Example.  Figure  21.4  shows  the  doppler  function  /  and  n  =  2,048 
observations  generated  from  the  model 

Yi  =  r(xi)  +  €i 

where  Xi  =  i/n,  f,  ~  iV(0,  (-1)2).  The  figure  shows  the  data  and  the  estimated 
function.  The  estimate  was  based  on  J  =  234  terms.  ■ 


We  are  now  ready  to  give  a  complete  description  of  the  method. 


Orthogonal  Series  Regression  Estimator 

1.  Let 

1  n 

Pj  =  -22Yi<Pj(xi),  j  =  l,...,n. 

i=  1 

2.  Let 

n 

°  k  N 

i=n-k-\- 1 

(21.28) 
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FIGURE  21.4.  Data  from  the  doppler  test  function  and  the  estimated  function.  See 
Example  21.10. 

where  k  nj  4.  I 

3.  For  1  <  J  <  n,  compute  the  risk  estimate  I 

^2  n  /  ^2  \ 

=  ■ 

J  =  J+1  V  7  + 

4.  Choose  J  £  {1, ...  n}  to  minimize  R(J). 

5.  Let 

j 

?(x)  = 

3  =  1 


Finally,  we  turn  to  confidence  bands.  As  before,  these  bands  are  not  really 
for  the  true  function  r(x)  but  rather  for  the  smoothed  version  of  the  function 


rAx)  =  T,Ji=iPj4>j(x)- 


'3 


21.11  Theorem.  Suppose  the  estimate  r  is  based  on  J  terms  and  a  is  defined 
as  in  equation  (21.28).  Assume  that  J  <  n  —  k  +  1.  An  approximate  1  —  a 
confidence  band  for  rj  is  (£,  u )  where 


t{x)  =  rn(x)  -  c,  u(x)  =  rn(x)  +  c, 


(21.29) 


where 


c 


a(x)  d  XJ, 


a 


a{x) 


\ 


j 


3  = 1 
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and  a  is  given  in  equation  (21.28). 


Proof.  Let  L  =  Ylj=i(Pj  —  /3j)  .  By  the  central  limit  theorem,  f3j  ~ 


N([3j,cr2 /n).  Hence,  f3j  «  (3j  +  crej/y/n  where  €j  ~  N( 0, 1)  and  therefore 


Thus. 


Also. 


o  J  2 

Lk^Y,^  =  -X2j. 

n  J  n 

3  = 1 


v(l>^xIo)  =V(^X2j 

\  n  ’ J  \n 


> 

n 


=  a. 


f(x)  -  rj(x) I  <  Yi  \MX)\  \Pj  -  Pj 


3  = 1 


3  = 1 


J=1 


<  a(x)  a/T 


by  the  Cauchy- Schwartz  inequality  (Theorem  4.8).  So, 


p  U  IfrM  -/Ml  >  dJ^L 

\  x  a(x)  yn 


<  P  (y/L> 


aXJ,i 


and  the  result  follows.  ■ 

21.12  Example.  Figure  21.5  shows  the  confidence  envelope  for  the  doppler 
signal.  The  first  plot  is  based  on  J  =  234  (the  value  of  J  that  minimizes  the 
estimated  risk) .  The  second  is  based  on  J  =  45  «  yfn.  Larger  J  yields  a  higher 
resolution  estimator  at  the  cost  of  large  confidence  bands.  Smaller  J  yields  a 
lower  resolution  estimator  but  has  tighter  confidence  bands.  ■ 

So  far,  we  have  assumed  that  the  ay’s  are  of  the  form  {1/n,  2/n, . . . ,  1}. 
If  the  ay’s  are  on  interval  [a,  6],  then  we  can  rescale  them  so  that  are  in  the 
interval  [0, 1].  If  the  ay’s  are  not  equally  spaced,  the  methods  we  have  discussed 
still  apply  so  long  as  the  ay’s  “fill  out”  the  interval  [0,1]  in  such  a  way  so  as  to 
not  be  too  clumped  together.  If  we  want  to  treat  the  ay’s  as  random  instead 
of  fixed,  then  the  method  needs  significant  modifications  which  we  shall  not 
deal  with  here. 
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FIGURE  21.5.  Estimates  and  confidence  bands  for  the  doppler  test  function  using 
n  =  2,048  observations.  First  plot:  J  =  234  terms.  Second  plot:  J  =  45  terms. 


21.4  Wavelets 

Suppose  there  is  a  sharp  jump  in  a  regression  function  /  at  some  point  x 
but  that  /  is  otherwise  very  smooth.  Such  a  function  /  is  said  to  be  spa¬ 
tially  inhomogeneous.  The  doppler  function  is  an  example  of  a  spatially 
inhomogeneous  function;  it  is  smooth  for  large  x  and  unsmooth  for  small  x. 

It  is  hard  to  estimate  /  using  the  methods  we  have  discussed  so  far.  If  we 
use  a  cosine  basis  and  only  keep  low  order  terms,  we  will  miss  the  peak;  if 
we  allow  higher  order  terms  we  will  find  the  peak  but  we  will  make  the  rest 
of  the  curve  very  wiggly.  Similar  comments  apply  to  kernel  regression.  If  we 
use  a  large  bandwidth,  then  we  will  smooth  out  the  peak;  if  we  use  a  small 
bandwidth,  then  we  will  find  the  peak  but  we  will  make  the  rest  of  the  curve 
very  wiggly. 

One  way  to  estimate  inhomogeneous  functions  is  to  use  a  more  carefully 
chosen  basis  that  allows  us  to  place  a  “blip”  in  some  small  region  without 
adding  wiggles  elsewhere.  In  this  section,  we  describe  a  special  class  of  bases 
called  wavelets,  that  are  aimed  at  fixing  this  problem.  Statistical  inference 
using  wavelets  is  a  large  and  active  area.  We  will  just  discuss  a  few  of  the 
main  ideas  to  get  a  flavor  of  this  approach. 

We  start  with  a  particular  wavelet  called  the  Haar  wavelet.  The  Haar 
father  wavelet  or  Haar  scaling  function  is  defined  by 


<t>{x)  = 


1  if  0  <  x  <  1 
0  otherwise. 


(21.30) 
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The  mother  Haar  wavelet  is  defined  by 


ip(x) 


1  if  0  <  x  < 

1  if  \  <  x  <  1. 


For  any  integers  j  and  k  define 


ipj,k{x)  =  2j/‘2^(2jx  -  k). 


(21.31) 


(21.32) 


The  function  has  the  same  shape  as  ip  but  it  has  been  rescaled  by  a  factor 
of  2J/2  and  shifted  by  a  factor  of  k. 

See  Figure  21.6  for  some  examples  of  Haar  wavelets.  Notice  that  for  large 
j,  is  a  very  localized  function.  This  makes  it  possible  to  add  a  blip  to  a 
function  in  one  place  without  adding  wiggles  elsewhere.  Increasing  j  is  like 
looking  in  a  microscope  at  increasing  degrees  of  resolution.  In  technical  terms, 
we  say  that  wavelets  provide  a  multiresolution  analysis  of  ^2(6, 1). 


2 

1 

0 

-1 

-2 


FIGURE  21.6.  Some  Haar  wavelets.  Left:  the  mother  wavelet  'ip(x);  Right:  ^2,2 (x). 
Let 

Wj  =  {V'ifc,  k  =  0, 1, . . . ,  2J  -  1} 

be  the  set  of  rescaled  and  shifted  mother  wavelets  at  resolution  j. 

21.13  Theorem.  The  set  of  functions 

U,Wo,WuW2,...,\ 


is  an  orthonormal  basis  for  £2(0, 1). 
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It  follows  from  this  theorem  that  we  can  expand  any  function  /  £  £2(0, 1)  in 
this  basis.  Because  each  W j  is  itself  a  set  of  functions,  we  write  the  expansion 
as  a  double  sum: 


We  call  a  the  scaling  coefficient  and  the  /3^’s  are  called  the  detail 
coefficients.  We  call  the  finite  sum 

J-l  2j-l 

fj(x)  =  a4>(x)  +  ^2  52  Pi^jAx)  (21.34) 

j= 0  k= 0 

the  resolution  J  approximation  to  /.  The  total  number  of  terms  in  this  sum 
is 

j-i 

1  +  ^  2j  =  1  +  2J  -  1  =  2J. 

j= 0 

21.14  Example.  Figure  21.7  shows  the  doppler  signal,  and  its  reconstruction 
using  J  =  3,  5  and  J  =  8.  ■ 

Haar  wavelets  are  localized,  meaning  that  they  are  zero  outside  an  interval. 
But  they  are  not  smooth.  This  raises  the  question  of  whether  there  exist 
smooth,  localized  wavelets  that  from  an  orthonormal  basis.  In  1988,  Ingrid 
Daubechie  showed  that  such  wavelets  do  exist.  These  smooth  wavelets  are 
difficult  to  describe.  They  can  be  constructed  numerically  but  there  is  no 
closed  form  formula  for  the  smoother  wavelets.  To  keep  things  simple,  we  will 
continue  to  use  Haar  wavelets. 

Consider  the  regression  model  =  r{xj)  +  aei  where  ^  iV(0, 1)  and 
Xi  =  i/n.  To  simplify  the  discussion  we  assume  that  n  =  2J  for  some  J. 

There  is  one  major  difference  between  estimation  using  wavelets  instead  of 
a  cosine  (or  polynomial)  basis.  With  the  cosine  basis,  we  used  all  the  terms 
1  <j<  J  for  some  J.  The  number  of  terms  J  acted  as  a  smoothing  parameter. 
With  wavelets,  we  control  smoothing  using  a  method  called  thresholding 
where  we  keep  a  term  in  the  function  approximation  if  its  coefficient  is  large, 
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0.0  0.2  0.4  0.6  O.S  1  .O  0.0  0.2  0.4  0.6  O.S  1  .O 


FIGURE  21.7.  The  doppler  signal  and  its  reconstruction 
fj(x)  =  ol4>{x)  +  YlJjZo  J2k  0j:k^j:k(x)  based  on  J  =  3,  J  =  5,  and  J  =  8. 


otherwise,  we  throw  out  that  term.  There  are  many  versions  of  thresholding. 
The  simplest  is  called  hard,  universal  thresholding.  Let  J  =  log2(n)  and  define 


2  2  T 

a  =  -  2_,  4>k(xi)Yt  and  D^k  =  -  2_,  i>j,k(xi)Yi 

TX  Tl 


(21.35) 


for  0  <  j  <  J  —  1 


Haar  Wavelet  Regression 


1.  Compute  a  and  Dj ^  as  in  (21.35),  for  0  <  j  <  J  —  1 

2.  Estimate  a;  see  (21.37). 

3.  Apply  universal  thresholding: 


Pj,k 


Dj,k  if  \Djik\  <7 
0  otherwise. 


2  log  n 


(21.36) 


4.  Set  f{x)  =  a<j)(x)  +  YI=  o  Pj,k^j,kix) 
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In  practice,  we  do  not  compute  Sk  and  Dj^k  using  (21.35).  Instead,  we  use 
the  discrete  wavelet  transform  (DWT)  which  is  very  fast.  The  DWT  for 
Haar  wavelets  is  described  in  the  appendix.  The  estimate  of  a  is 


a 


n  x 


median  (| Dj_ i^\  :  k  =  0,...,2J  1  —  l) 


0.6745 


(21.37) 


The  estimate  for  a  may  look  strange.  It  is  similar  to  the  estimate  we  used 
for  the  cosine  basis  but  it  is  designed  to  be  insensitive  to  sharp  peaks  in  the 
function. 

To  understand  the  intuition  behind  universal  thresholding,  consider  what 
happens  when  there  is  no  signal,  that  is,  when  f3j^  =  0  for  all  j  and  k. 

21.15  Theorem.  Suppose  that  f3jyk  =  0  for  all  j  and  k  and  let  f3j ^  be  the 
universal  threshold  estimator.  Then 


^(f3j,k  =  0  f°r  all  j,  k)  1 


as  n  — >►  oo. 


r 


Proof.  To  simplify  the  proof,  assume  that  a  is  known.  Now  Dj ^ 
iV(0,cr2/n).  We  will  need  Mill’s  inequality  (Theorem  4.7):  if  Z  ^  jV(0, 1) 
then  P(|Z|  >  t)  <  (cjh)e~l  /2  where  c  =  2 /tt  is  a  constant.  Thus, 


P(max  \  Dj,k\  >  A) 


< 


< 


3,k 


>A)  =  EP 

3,k 


CCT 


c 

V21ogn 


-o  0.  ■ 


21.16  Example.  Consider  Yi  =  r(xi)  +  aei  where  /  is  the  doppler  signal, 
a  =  .1  and  n  =  2,  048.  Figure  21.8  shows  the  data  and  the  estimated  function 
using  universal  thresholding.  Of  course,  the  estimate  is  not  smooth  since  Haar 
wavelets  are  not  smooth.  Nonetheless,  the  estimate  is  quite  accurate.  ■ 
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FIGURE  21.8.  Estimate  of  the  Doppler  function  using  Haar  wavelets  and  universal 
thresholding. 

21.5  Appendix 

The  DWT  for  Haar  Wavelets.  Let  y  be  the  vector  of  IV s  (length  n)  and 
let  J  =  log2(n).  Create  a  list  D  with  elements 


Set: 


Then  do: 


D[[0]\,  . . . ,  D[[J  -  1[. 


temp  <—  y/y^n. 


for(j 

m 

I 


D 


J 


m 


23 

(1  :  m) 
temp[ 2  *  I\ 


temp[( 2  */)  —  !]] /y/2 


temp 


tempi 2  *  I]  +  tempK 2  */)  —  !]] /y/2 


} 


346 


21.  Smoothing  Using  Orthogonal  Functions 


21.6  Bibliographic  Remarks 

Efromovich  (1999)  is  a  reference  for  orthogonal  function  methods.  See  also 
Beran  (2000)  and  Beran  and  Diimbgen  (1998).  An  introduction  to  wavelets  is 
given  in  Ogden  (1997).  A  more  advanced  treatment  can  be  found  in  Hardle 
et  al.  (1998).  The  theory  of  statistical  estimation  using  wavelets  has  been 
developed  by  many  authors,  especially  David  Donoho  and  Ian  Johnstone.  See 
Donoho  and  Johnstone  (1994),  Donoho  and  Johnstone  (1995),  Donoho  et  al. 
(1995),  and  Donoho  and  Johnstone  (1998). 


21.7  Exercises 


1.  Prove  Theorem  21.5. 

2.  Prove  Theorem  21.9. 

3.  Let 


11  2  \ 


Show  that  these  vectors  have  norm  1  and  are  orthogonal. 

4.  Prove  Parseval’s  relation  equation  (21.6). 

5.  Plot  the  first  five  Legendre  polynomials.  Verify,  numerically,  that  they 
are  orthonormal. 


6.  Expand  the  following  functions  in  the  cosine  basis  on  [0,1].  For  (a) 
and  (b),  find  the  coefficients  f3j  analytically.  For  (c)  and  (d),  find  the 
coefficients  f3j  numerically,  i.e. 

/3,  =  J  f(x)W- r)  »  jj  E  /  ( v)  ^  ( V) 

r=  1 

for  some  large  integer  N.  Then  plot  the  partial  sum  Y^j=i  Pjfiji00)  f°r 
increasing  values  of  n. 

(a)  f(x)  =  y/2  cos(37tx). 

(b)  f(x)  =  sin(7nr). 

(c)  f(x)  =  hjK(x  —  tj )  where  K(t)  =  (l  +  sign(£))/2,  sign(x)  =  — 1 

if  x  <  0,  sign(x)  =  0  if  x  =  0,  sign(x)  =  1  if  x  >  0, 
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(tj)  =  (.1,  .13,  .15,  .23,  .25,  .40,  .44,  .65,  .76,  .78,  .81), 

(hj)  =  (4,  -5,  3,  -4, 5,  -4.2,  2.1, 4.3,  -3.1, 2.1,  -4.2). 

(d)  /  =  y/x{\  -  x )  sin  ((^sy)  • 

7.  Consider  the  glass  fragments  data  from  the  book’s  website.  Let  Y  be 
refractive  index  and  let  X  be  aluminum  content  (the  fourth  variable). 

(a)  Do  a  nonparametric  regression  to  fit  the  model  Y  =  f(x)  +  e  using 
the  cosine  basis  method.  The  data  are  not  on  a  regular  grid.  Ignore  this 
when  estimating  the  function.  (But  do  sort  the  data  first  according  to 
x .)  Provide  a  function  estimate,  an  estimate  of  the  risk,  and  a  confidence 
band. 

(b)  Use  the  wavelet  method  to  estimate  /. 

8.  Show  that  the  Haar  wavelets  are  orthonormal. 

9.  Consider  again  the  doppler  signal: 


Let  n  =  1,  024,  a  =  0.1,  and  let  (#i, . . . ,  xn)  =  (1/n, . . . ,  1).  Generate 
data 

Yi  =  f(xi)  +  ati 

where  e-  r^j  N(  0,1). 

(a)  Fit  the  curve  using  the  cosine  basis  method.  Plot  the  function  esti¬ 
mate  and  confidence  band  for  J  =  10,  20, . . . ,  100. 

(b)  Use  Haar  wavelets  to  fit  the  curve. 

10.  (Haar  density  Estimation.)  Let  Xi, . . .  ,Xn  ^  /  for  some  density  /  on 
[0, 1].  Let’s  consider  constructing  a  wavelet  histogram.  Let  <fi  and  ip  be 
the  Haar  father  and  mother  wavelet.  Write 

J-l  2j-l 

f(x)  «  <t>{x)  +  YhY  Pjtki’jAx) 

j= 0  k= 0 


1  n 

Pj,k  = — Vu,fc(^c) 

/  L 

i= 1 


where  J  ~  log2(n).  Let 
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(a)  Show  that  (3j^k  is  an  unbiased  estimate  of  (3j 

(b)  Define  the  Haar  histogram 


B  2J-1 

f(x)  =  4>{x)  +  J2Y1  Pj.ki’iAx) 

j= 0  k= 0 


for  0  <  B  <  J  —  1. 

(c)  Find  an  approximate  expression  for  the  MSE  as  a  function  of  B. 

(d)  Generate  n  =  1,000  observations  from  a  Beta  (15,4)  density.  Es¬ 
timate  the  density  using  the  Haar  histogram.  Use  leave-one-out  cross 
validation  to  choose  B. 


11. 


In  this  question,  we  will  explore  the  motivation  for  equation  (21.37).  Let 
Xi, . . . ,  Xn  ~  N( 0,  a2).  Let 


a 


n  x 


median  (|X] 


Xn\) 


0.6745 


(a)  Show  that  E(cr)  =  a. 

(b)  Simulate  n  =  100  observations  from  a  N(0,1)  distribution.  Compute 
a  as  well  as  the  usual  estimate  of  a.  Repeat  1,000  times  and  compare 
the  MSE. 

(c)  Repeat  (b)  but  add  some  outliers  to  the  data.  To  do  this,  simulate 
each  observation  from  a  N(0,1)  with  probability  .95  and  simulate  each 
observation  from  a  N(0,10)  with  probability  .95. 

12.  Repeat  question  6  using  the  Haar  basis. 


22 

Classification 


22.1  Introduction 


The  problem  of  predicting  a  discrete  random  variable  Y  from  another  random 
variable  X  is  called  classification,  supervised  learning,  discrimination, 
or  pattern  recognition. 

Consider  HD  data  (Xi,  Yi), ... ,  (Xn,  Yn)  where 

Xt  =  (Xn,...,Xid)  eX  cRd 


is  a  d-dimensional  vector  and  Yi  takes  values  in  some  finite  set  y.  A  classifi¬ 
cation  rule  is  a  function  h  :  X  y.  When  we  observe  a  new  X,  we  predict 
Y  to  be  ft(X). 


22.1  Example.  Here  is  a  an  example  with  fake  data.  Figure  22.1  shows  100 
data  points.  The  covariate  X  =  (Xi,X2)  is  2-dimensional  and  the  outcome 
Y  G  y  =  {0, 1}.  The  Y  values  are  indicated  on  the  plot  with  the  triangles 
representing  Y  =  1  and  the  squares  representing  Y  =  0.  Also  shown  is  a  linear 
classification  rule  represented  by  the  solid  line.  This  is  a  rule  of  the  form 


1  if  a  +  b\X\  +  >  0 

0  otherwise. 


Everything  above  the  line  is  classified  as  a  0  and  everything  below  the  line  is 
classified  as  a  1.  ■ 
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X2 


□ 


A 


□ 

□  □ 


A 

A 


□  □ 


X\ 


FIGURE  22.1.  Two  covariates  and  a  linear  decision  boundary.  A  means  Y  —  1. 

□  means  Y  =  0.  These  two  groups  are  perfectly  separated  by  the  linear  decision 
boundary;  you  probably  won’t  see  real  data  like  this. 


22.2  Example.  Recall  the  the  Coronary  Risk-Factor  Study  (CORIS)  data 
from  Example  13.17.  There  are  462  males  between  the  ages  of  15  and  64  from 
three  rural  areas  in  South  Africa.  The  outcome  Y  is  the  presence  (Y  =  1)  or 
absence  (Y  =  0)  of  coronary  heart  disease  and  there  are  9  covariates:  systolic 
blood  pressure,  cumulative  tobacco  (kg),  ldl  (low  density  lipoprotein  choles¬ 
terol),  adiposity,  famhist  (family  history  of  heart  disease),  typea  (type- A  be¬ 
havior),  obesity,  alcohol  (current  alcohol  consumption),  and  age.  I  computed 
a  linear  decision  boundary  using  the  LDA  method  based  on  two  of  the  co¬ 
variates,  systolic  blood  pressure  and  tobacco  consumption.  The  LDA  method 
will  be  explained  shortly.  In  this  example,  the  groups  are  hard  to  tell  apart. 
In  fact,  141  of  the  462  subjects  are  misclassihed  using  this  classification  rule. 


At  this  point,  it  is  worth  revisiting  the  Statistics/Data  Mining  dictionary: 


Statistics 

classification 

data 

covariates 

classifier 

estimation 


Computer  Science 
supervised  learning 
training  sample 
features 
hypothesis 
learning 


Meaning 

predicting  a  discrete  Y  from  X 

(X1,Y1),...,(Xn,Yn) 

the  X^’s 

map  h  :  X  y 

finding  a  good  classifier 


22.2  Error  Rates  and  the  Bayes  Classifier 


Our  goal  is  to  find  a  classification  rule  h  that  makes  accurate  predictions.  We 
start  with  the  following  definitions: 


22.2  Error  Rates  and  the  Bayes  Classifier 
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22.3  Definition.  The  true  error  rate1  of  a  classifier  h  is 

L(h)  =  ¥({h(X )  ±  y}) 

and  the  empirical  error  rate  or  training  error  rate  is 


Ln(h)  =  -TmXi)^Yi). 
rn  f  ^ 


i=  1 


(22.1) 

(22.2) 


First  we  consider  the  special  case  where  y  =  {0, 1}.  Let 

r(x)  =  E(Y|X  =  x)=  P(Y  =  1\X  =  x ) 
denote  the  regression  function.  From  Bayes’  theorem  we  have  that 


r(x) 


P(y  =  1\X  =  x ) 

f(x\Y  =  i)P(y  =  i) 
f(x\Y  =  i)P(y  =  l)  +  f(x\Y  =  o)P(y 
_ nfi(x) _ 

nfi(x)  +  (1  -  7t)/0(x) 


(22.3) 


where 


fo(x)  =  f(x\Y  =  0) 
fi(x)  =  f(x\Y  =  1) 

7r  =  p(y  =  i). 


22.4  Definition.  The  Bayes  classification  rule  h*  is 


h*(x) 


1  if  r(x)  >  \ 
0  otherwise. 


(22.4) 


The  set  V(h)  =  {x  :  P(F  =  1\X  =  x)  =  P(F  =  0|X  =  x)}  is  called  the 

decision  boundary. 


Warning!  The  Bayes  rule  has  nothing  to  do  with  Bayesian  inference.  We 
could  estimate  the  Bayes  rule  using  either  frequentist  or  Bayesian  methods. 
The  Bayes  rule  may  be  written  in  several  equivalent  forms: 

1One  can  use  other  loss  functions.  For  simplicity  we  will  use  the  error  rate  as  our  loss  function. 
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h*(x) 


and 


fi  if  p(y  =  i\x  =  x)  >  p(y  =  o\x  =  x) 

I  0  otherwise 

h*tT\  =  /  1  if  K.h{x)  >  (1  -  7r)/o(x) 

'  '  y  0  otherwise. 


(22.5) 


(22.6) 


22.5  Theorem.  The  Bayes  rule  is  optimal,  that  is,  if  h  is  any  other  classifi¬ 
cation  rule  then  L(h*)  <  L{h). 

The  Bayes  rule  depends  on  unknown  quantities  so  we  need  to  use  the  data 
to  find  some  approximation  to  the  Bayes  rule.  At  the  risk  of  oversimplifying, 
there  are  three  main  approaches: 

1.  Empirical  Risk  Minimization.  Choose  a  set  of  classifiers  TL  and  find  h  E  TL 
that  minimizes  some  estimate  of  L(h). 

2.  Regression.  Find  an  estimate  r  of  the  regression  function  r  and  define 

htX)  =  / 1  if  >  \ 

'  '  \  0  otherwise. 

3.  Density  Estimation.  Estimate  /o  from  the  X^’s  for  which  Yi  =  0,  estimate 
/i  from  the  Xfs  for  which  Yi  =  1  and  let  7 f  =  n_1  Y^=i  Define 


r(x)  =  P(T  =  1\X  =  x) 


_ 7f/l  (x) _ 

+  (1  -  n)fo(x) 


and 


h(x) 


1  if  r(x)  >  \ 
0  otherwise. 


Now  let  us  generalize  to  the  case  where  Y  takes  on  more  than  two  values 
as  follows. 


22.6  Theorem.  Suppose  that  Y  £  T  =  K}.  The  optimal  rule  is 


h(x)  =  argmaxfc  P(T  =  k\X  =  x) 
=  argma xkirk  fk(x) 


(22.7) 

(22.8) 


(22.9) 


where 

nY  -  k'x  - 11  -  (222 

7 rr  =  P(Y  =  r),  fr(x)  =  f(x\Y  =  r)  and  argmax^  means  uthe  value  of  k 
that  maximizes  that  expression. ” 


22.3  Gaussian  and  Linear  Classifiers 
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22.3  Gaussian  and  Linear  Classifiers 


Perhaps  the  simplest  approach  to  classification  is  to  use  the  density  estima¬ 
tion  strategy  and  assume  a  parametric  model  for  the  densities.  Suppose  that 
y  =  {0, 1}  and  that  fo(x)  =  f(x\Y  =  0)  and  fi(x)  =  f(x\Y  =  1)  are  both 
multivariate  Gaussians: 


fk{x) 


1 


1 


(27r)d/2|£/c|1/2 


exp 


Vx  -  jJLkf^k  l(x  -  Mfc)  L  k  =  0,1 


Thus,  X\Y  =  0  ^  N(/a o,  Sq)  and  X\Y  =  1  ^  iV(/xi,  Si). 


22.7  Theorem.  If  X\Y 

Bayes  rule  is 


0  ^  N(fi o,  Sq)  and  X\Y 


1  ^  Si),  then  the 


h*(x) 


if  r\  <  rl  +  2  log  (^  )  +  log 
otherwise 


(22.10) 


where 

rf  =  (x  -  ni)TT,~1(x  -  fii),  i  =  1,2  (22.11) 

is  the  Manalahobis  distance.  An  equivalent  way  of  expressing  the  Bayes 7 
rule  is 

h*(x)  =  argmaxfc5fc(x) 


where 


1  1 

$k(x)  =  —  2  loS  lSfel  -  2  G  “  lJ'k)T^k  (x  -  Mfc)  + 


(22.12) 


and  \A\  denotes  the  determinant  of  a  matrix  A. 


The  decision  boundary  of  the  above  classifier  is  quadratic  so  this  procedure 
is  called  quadratic  discriminant  analysis  (QDA).  In  practice,  we  use 
sample  estimates  of  7r,  /xi,  /i2,  So,  Si  in  place  of  the  true  value,  namely: 


TTO 


ho 

So 


1  n  1  n 

-£(i-ra  5?!  =  -^ 

i= 1  i=  1 

—  Y  Xi,  mi  =  —  Y  Xi 

no  n\ 

U  i:  Yi=  0  i:  Y;  =  l 


—  Y  (V-Mo)(V-Mof,  s1  =  —  V  (Xi  -  th)(Xi  -  fiif 

no  z — '  re 

U  i:  Yi=0  1  i:  Yi  =  1 


where  n0  =  1  -  U)  and  ni  =  L . 
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A  simplification  occurs  if  we  assume  that  Eo  =  So  =  E.  In  that  case,  the 
Bayes  rule  is 

h*(x)  =  argma  xkSk(x)  (22.13) 

where  now 

Sk(x)  =  xTT,~1/j,k  -  E_1  +  lognk.  (22.14) 

The  parameters  are  estimated  as  before,  except  that  the  mle  of  E  is 


The  classification  rule  is 


g  _  no^o  +  riiSi 
no  +  ni 


h*(x) 


1  if  S i(x)  >  5o(^) 
0  otherwise 


(22.15) 


where 


T  o—  1 


x  S 


1 

2 


+  log  7Tj 


is  called  the  discriminant  function.  The  decision  boundary  {x  :  5q(^)  — 
5i(x)}  is  linear  so  this  method  is  called  linear  discrimination  analysis 
(LDA). 


22.8  Example.  Let  us  return  to  the  South  African  heart  disease  data.  The 
decision  rule  in  in  Example  22.2  was  obtained  by  linear  discrimination.  The 
outcome  was 

classified  as  0  classified  as  1 
y  =  0  277  25 

y  =  1  116  44 

The  observed  misclassihcation  rate  is  141/462  =  .31.  Including  all  the  covari¬ 
ates  reduces  the  error  rate  to  .27.  The  results  from  quadratic  discrimination 
are 

classified  as  0  classified  as  1 
y  =  0  272  30 

y  =  1  113  47 

which  has  about  the  same  error  rate  143/462  =  .31.  Including  all  the  covariates 
reduces  the  error  rate  to  .26.  In  this  example,  there  is  little  advantage  to  QDA 
over  LDA.  ■ 


Now  we  generalize  to  the  case  where  Y  takes  on  more  than  two  values. 


22.3  Gaussian  and  Linear  Classifiers 
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22.9  Theorem.  Suppose  that  Y  E  {1, . . .  ,K}.  If  fk(x)  =  f(x\Y  =  k)  is 
Gaussian ,  the  Bayes  rule  is 

h(x)  =  argmaxfcd/c(x) 

where 

1  1 

Sk(x)  =  —z  log|Sfe|  -  -(x  -  nk)TEp(x  -  fik)  +  log7rfe.  (22.16) 
If  the  variances  of  the  Gaussians  are  equal ,  then 

5k(x)  =  •  (22.17) 


We  estimate  5k(x)  by  by  inserting  estimates  of  /i&,  and  7T&.  There  is 
another  version  of  linear  discriminant  analysis  due  to  Fisher.  The  idea  is 
to  first  reduce  the  dimension  of  covariates  to  one  dimension  by  projecting 
the  data  onto  a  line.  Algebraically,  this  means  replacing  the  covariate  X  = 
(Xi, . . . ,  Xd)  with  a  linear  combination  U  =  wTX  =  Z-=1  W  jXj.  The  goal  is 
to  choose  the  vector  w  =  (w i, . . .  ,Wd)  that  “best  separates  the  data.”  Then 
we  perform  classification  with  the  one-dimensional  covariate  Z  instead  of  X. 

We  need  define  what  we  mean  by  separation  of  the  groups.  We  would  like 
the  two  groups  to  have  means  that  are  far  apart  relative  to  their  spread.  Let 
fij  denote  the  mean  of  X  for  Yj  and  let  E  be  the  variance  matrix  of  X.  Then 
E(U\Y  =  j)  =  E(wTX\Y  =  j)  =  wTvj  and  N{U)  =  wTEw.  2  Define  the 
separation  by 

(E(C/|y  =  0)  -E(U\Y  =  l))2 

wtTjW 

(wT  /i  o  —  wT  ji  i)2 
wTYw 

WT(li Q  -  /ii)(/i0  -  jii)Tw 
wtTjW 


We  estimate  J  as  follows.  Let  nj  =  Ym=i  ^0^  =  i)  number  of  obser¬ 

vations  in  group  j,  let  Xj  be  the  sample  mean  vector  of  the  X’s  for  group  j, 
and  let  Sj  be  the  sample  covariance  matrix  in  group  j.  Define 


wT  Sbvj 
wT  Sww 


(22.18) 


2 


The  quantity  J  arises  in  physics,  where  it  is  called  the  Rayleigh  coefficient. 
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where 


Sw 


22.10  Theorem.  The  vector 


( X0-X1){X0-Xl)T 
(n0  -  1)50  +  (ni  -  l)5i 
(n0  -  1)  +  (ni  -  1) 


w  =  S^(X  o-Xi) 


is  a  minimizer  of  J(w).  We  call 


u  =  wTX  =  (X0  -  Xi )t5'w1X 


(22.19) 


(22.20) 


the  Fisher  linear  discriminant  function.  The  midpoint  m  between  Xo  and 
X\  is 

m=^(X0+X1)  =  ^(Xo-XifSpiXo  +  Xi)  (22.21) 


Fisher' s  classification  rule  is 


h(x) 


0  if  wT X  >  m 
1  if  wT X  <  m. 


Fisher’s  rule  is  the  same  as  the  Bayes  linear  classifier  in  equation  (22.14) 
when  7?  =  1/2. 


22.4  Linear  Regression  and  Logistic  Regression 

A  more  direct  approach  to  classification  is  to  estimate  the  regression  function 
r(x)  =  E(Y\X  =  x)  without  bothering  to  estimate  the  densities  /&.  For  the 
rest  of  this  section,  we  will  only  consider  the  case  where  y  =  {0, 1}.  Thus, 
r(x)  =  P(Y  =  1\X  =  x)  and  once  we  have  an  estimate  r,  we  will  use  the 
classification  rule 

h( i)  =  (  !  “  >  5  (22.22) 

v  y  (  iJ  otherwise.  v  y 

The  simplest  regression  model  is  the  linear  regression  model 

d 

Y  =  r(x)  +  e  =  /?o  +  fijXj  +  e  (22.23) 

3  = 1 

where  E(e)  =  0.  This  model  can’t  be  correct  since  it  does  not  force  Y  =  0  or 
1.  Nonetheless,  it  can  sometimes  lead  to  a  decent  classifier. 


22.4  Linear  Regression  and  Logistic  Regression 
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Recall  that  the  least  squares  estimate  of  (3  =  (/3o,/?i> 
the  residual  sums  of  squares 


. ,  /3d)T  minimizes 


RSS 


n  /  d  \ 

(I3)  =  J2[Y1-Po-J2X^I3A 

i=  1  '  j  =  l  ' 


Let  X  denote  the  N  x  (d  +  1)  matrix  of  the  form 


X  = 


1  In 
1  X2i 


•  •  Xid 

•  •  X2  d 


1  Xnl  ...  X 


nd 


Also  let  Y  =  (Yi, . . . ,  Yn)T .  Then, 


RSS(/3)  =  (Y  -  X/3)t(Y  -  X/3) 


and  the  model  can  be  written  as 


Y  =  X/3  +  e 

where  e  =  (ei, . . . ,  en)T.  From  Theorem  13.13. 


/ 3  =  (xtx)_1xty. 


The  predicted  values  are 

Y  =  X/3. 

Now  we  use  (22.22)  to  classify,  where  r  (x)  —  f3  o  +  (3jXj. 

An  alternative  is  to  use  logistic  regression  which  was  also  discussed  in  Chap¬ 
ter  13.  The  model  is 


r(x)  =  P(Y  =  1|X  =  x) 


gdo+Ej  PjXj 
l  edo+Ej  PjXj 


(22.24) 


and  the  mle  /?  is  obtained  numerically. 


22.11  Example.  Let  us  return  to  the  heart  disease  data.  The  mle  is  given  in 
Example  13.17.  The  error  rate,  using  this  model  for  classification,  is  .27.  The 
error  rate  from  a  linear  regression  is  .26. 


We  can  get  a  better  classifier  by  fitting  a  richer  model.  For  example,  we 
could  fit 


logit  P(Y  =  1|X  =  x)  =  A)  +  '^ftjXj  +  y ^(3jkXjXk 


(22.25) 


3 
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More  generally,  we  could  add  terms  of  up  to  order  r  for  some  integer  r.  Large 
values  of  r  give  a  more  complicated  model  which  should  fit  the  data  better. 
But  there  is  a  bias-variance  tradeoff  which  we’ll  discuss  later. 

22.12  Example.  If  we  use  model  (22.25)  for  the  heart  disease  data  with  r  =  2, 
the  error  rate  is  reduced  to  .22.  ■ 


22.5  Relationship  Between  Logistic  Regression  and 
LDA 

LDA  and  logistic  regression  are  almost  the  same  thing.  If  we  assume  that  each 
group  is  Gaussian  with  the  same  covariance  matrix,  then  we  saw  earlier  that 


,  f¥(Y  =  l\X  =  x) 

og  (p(y  =  o|x  =  x) 


log 


TTo 

7Ti 
.Tr-1 


\ 

^(Mo  +Mi)  E_1(Mi 


Mo) 


+  x  £  (/xi  -  jUq) 


T 

OL 0  ~\~  OL  X. 


On  the  other  hand,  the  logistic  model  is,  by  assumption, 


log 


P(Y  =  1 

X  =  x) 

P(Y  =  0 

X  =  x) 

=  /3o  +  f3Tx 


These  are  the  same  model  since  they  both  lead  to  classification  rules  that  are 
linear  in  x.  The  difference  is  in  how  we  estimate  the  parameters. 

The  joint  density  of  a  single  observation  is /(#,  y)  =  f(x\y)f(y)  =  f(y\x)f(x) 
In  LDA  we  estimated  the  whole  joint  distribution  by  maximizing  the  likeli¬ 
hood 

Y[f(xi,yi)  =  Y[f(xi\yi)Y[f(yi)  ■  (22.26) 


Gaussian  Bernoulli 

In  logistic  regression  we  maximized  the  conditional  likelihood  f(yi\xi)  but 
we  ignored  the  second  term  f(xi): 


n  f(xi,  Vi) = n  ix») 


(22.27) 


logistic 


ignored 


Since  classihcation  only  requires  knowing  f(y |x),  we  don’t  really  need  to  es¬ 
timate  the  whole  joint  distribution.  Logistic  regression  leaves  the  marginal 


22.6  Density  Estimation  and  Naive  Bayes 


359 


distribution  f{pc)  unspecified  so  it  is  more  nonparametric  than  LDA.  This  is 
an  advantage  of  the  logistic  regression  approach  over  LDA. 

To  summarize:  LDA  and  logistic  regression  both  lead  to  a  linear  classi¬ 
fication  rule.  In  LDA  we  estimate  the  entire  joint  distribution  f(x,y)  = 
f(x\y)f(y).  In  logistic  regression  we  only  estimate  f(y \x)  and  we  don’t  bother 
estimating  f(x). 


22.6  Density  Estimation  and  Naive  Bayes 

The  Bayes  rule  is  h(x)  =  argmaxfc  7Tkfk(x)-  If  we  can  estimate  TYk  and  fk 
then  we  can  estimate  the  Bayes  classification  rule.  Estimating  7T&  is  easy  but 
what  about  /&?  We  did  this  previously  by  assuming  fk  was  Gaussian.  An¬ 
other  strategy  is  to  estimate  fk  with  some  nonparametric  density  estimator 
fk  such  as  a  kernel  estimator.  But  if  x  =  (#1,...,#^)  is  high-dimensional, 
nonparametric  density  estimation  is  not  very  reliable.  This  problem  is  amelio¬ 
rated  if  we  assume  that  Xi, . . . ,  X j  are  independent,  for  then,  fk(%i, . . . ,  Xd)  = 
rT/=i  fkj(xj )•  This  reduces  the  problem  to  d  one-dimensional  density  estima¬ 
tion  problems,  within  each  of  the  k  groups.  The  resulting  classifier  is  called 
the  naive  Bayes  classifier.  The  assumption  that  the  components  of  X  are 
independent  is  usually  wrong  yet  the  resulting  classifier  might  still  be  accu¬ 
rate.  Here  is  a  summary  of  the  steps  in  the  naive  Bayes  classifier: 


The  Naive  Bayes  Classifier 

1.  For  each  group  /c,  compute  an  estimate  fkj  of  the  density  fkj  for  Xj, 
using  the  data  for  which  Yi  =  k. 


2.  Let 


fk(x)  =  fk(xi,...,Xd)  =  Ylfkj(Xj) 


3  = 1 


3.  Let 


=  :E^  =  fc) 


i=  1 


where  I(Yi  =  k)  =  1  if  Yi  =  k  and  I(Yi  =  k)  =  0  if  Yi  k. 


4.  Let 


h(x)  =  argmaxfc  9k  fk(x). 


360 


22.  Classification 


FIGURE  22.2.  A  simple  classification  tree. 

The  naive  Bayes  classifier  is  popular  when  x  is  high-dimensional  and  dis¬ 
crete.  In  that  case,  fkj(xj )  is  especially  simple. 


22.7  Trees 


Trees  are  classification  methods  that  partition  the  covariate  space  X  into 
disjoint  pieces  and  then  classify  the  observations  according  to  which  partition 
element  they  fall  in.  As  the  name  implies,  the  classifier  can  be  represented  as 
a  tree. 

For  illustration,  suppose  there  are  two  covariates,  X\  =  age  and  X2  =  blood 
pressure.  Figure  22.2  shows  a  classification  tree  using  these  variables. 

The  tree  is  used  in  the  following  way.  If  a  subject  has  Age  >  50  then  we 
classify  him  as  7  =  1.  If  a  subject  has  Age  <  50  then  we  check  his  blood 
pressure.  If  systolic  blood  pressure  is  <  100  then  we  classify  him  as  7  =  1, 
otherwise  we  classify  him  as  F  =  0.  Figure  22.3  shows  the  same  classifier  as 
a  partition  of  the  covariate  space. 

Here  is  how  a  tree  is  constructed.  First,  suppose  that  y  E  y  =  {0, 1}  and 
that  there  is  only  a  single  covariate  X.  We  choose  a  split  point  t  that  divides 
the  real  line  into  two  sets  A\  =  (— 00,  £]  and  A2  =  (£,00).  Let  ps(j)  be  the 
proportion  of  observations  in  As  such  that  Yi  =  j: 


Ps(j) 


E?=i  I(Yi  =  j,XleAs) 

eiu  m  €  as) 


(22.28) 


for  s 


1 ,  2  and  j 


0,1.  The  impurity  of  the  split  t  is  defined  to  be 

2 


i(t)  =  XA 

5=1 


(22.29) 
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FIGURE  22.3.  Partition  representation  of  classification  tree. 


where 

l 

7s  =  l- (22.30) 

3=0 

This  particular  measure  of  impurity  is  known  as  the  Gini  index.  If  a  partition 
element  As  contains  all  0’s  or  all  l’s,  then  js  =  0.  Otherwise,  js  >  0.  We 
choose  the  split  point  t  to  minimize  the  impurity.  (Other  indices  of  impurity 
besides  can  be  used  besides  the  Gini  index.) 

When  there  are  several  covariates,  we  choose  whichever  covariate  and  split 
that  leads  to  the  lowest  impurity.  This  process  is  continued  until  some  stopping 
criterion  is  met.  For  example,  we  might  stop  when  every  partition  element  has 
fewer  than  no  data  points,  where  no  is  some  fixed  number.  The  bottom  nodes 
of  the  tree  are  called  the  leaves.  Each  leaf  is  assigned  a  0  or  1  depending  on 
whether  there  are  more  data  points  with  Y  =  0  or  Y  =  1  in  that  partition 
element. 

This  procedure  is  easily  generalized  to  the  case  where  Y  E  {1 , . . . ,  iT}.  We 
simply  define  the  impurity  by 


k 

7-  =  1  -  Yp»0)2  (22.31) 

3  = 1 

where  Pi(j)  is  the  proportion  of  observations  in  the  partition  element  for  which 

Y  =  j. 
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FIGURE  22.4.  A  classification  tree  for  the  heart  disease  data  using  two  covariates. 

22.13  Example.  A  classification  tree  for  the  heart  disease  data  yields  a  mis- 
classification  rate  of  .21.  If  we  build  a  tree  using  only  tobacco  and  age,  the 
misclassification  rate  is  then  .29.  The  tree  is  shown  in  Figure  22.4.  ■ 

Our  description  of  how  to  build  trees  is  incomplete.  If  we  keep  splitting 
until  there  are  few  cases  in  each  leaf  of  the  tree,  we  are  likely  to  overfit  the 
data.  We  should  choose  the  complexity  of  the  tree  in  such  a  way  that  the 
estimated  true  error  rate  is  low.  In  the  next  section,  we  discuss  estimation  of 
the  error  rate. 


22.8  Assessing  Error  Rates  and  Choosing  a  Good 
Classifier 

How  do  we  choose  a  good  classifier?  We  would  like  to  have  a  classifier  h  with 
a  low  true  error  rate  L(h).  Usually,  we  can’t  use  the  training  error  rate  Ln(h) 
as  an  estimate  of  the  true  error  rate  because  it  is  biased  downward. 

22.14  Example.  Consider  the  heart  disease  data  again.  Suppose  we  fit  a  se¬ 
quence  of  logistic  regression  models.  In  the  first  model  we  include  one  co¬ 
variate.  In  the  second  model  we  include  two  covariates,  and  so  on.  The  ninth 
model  includes  all  the  covariates.  We  can  go  even  further.  Let’s  also  fit  a  tenth 
model  that  includes  all  nine  covariates  plus  the  first  covariate  squared.  Then 
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we  fit  an  eleventh  model  that  includes  all  nine  covariates  plus  the  first  covari¬ 
ate  squared  and  the  second  covariate  squared.  Continuing  this  way  we  will  get 
a  sequence  of  18  classifiers  of  increasing  complexity.  The  solid  line  in  Figure 
22.5  shows  the  observed  classification  error  which  steadily  decreases  as  we 
make  the  model  more  complex.  If  we  keep  going,  we  can  make  a  model  with 
zero  observed  classification  error.  The  dotted  line  shows  the  10-fold  cross- 
validation  estimate  of  the  error  rate  (to  be  explained  shortly)  which  is  a 
better  estimate  of  the  true  error  rate  than  the  observed  classification  error. 
The  estimated  error  decreases  for  a  while  then  increases.  This  is  essentially 
the  bias- variance  tradeoff  phenomenon  we  have  seen  in  Chapter  20.  ■ 


error  rate 


FIGURE  22.5.  The  solid  fine  is  the  observed  error  rate  and  dashed  line  is  the 
cross-validation  estimate  of  true  error  rate. 


There  are  many  ways  to  estimate  the  error  rate.  We’ll  consider  two:  cross- 
validation  and  probability  inequalities. 

Cross-Validation.  The  basic  idea  of  cross-validation,  which  we  have  al¬ 
ready  encountered  in  curve  estimation,  is  to  leave  out  some  of  the  data  when 
fitting  a  model.  The  simplest  version  of  cross-validation  involves  randomly 
splitting  the  data  into  two  pieces:  the  training  set  T  and  the  validation 
set  V.  Often,  about  10  per  cent  of  the  data  might  be  set  aside  as  the  validation 
set.  The  classifier  h  is  constructed  from  the  training  set.  We  then  estimate 
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Training  Data  T 


Validation  Data  V 


FIGURE  22.6.  Cross-validation.  The  data  are  divided  into  two  groups:  the  training 
data  and  the  validation  data.  The  training  data  are  used  to  produce  an  estimated 
classifier  h.  Then,  h  is  applied  to  the  validation  data  to  obtain  an  estimate  L  of  the 
error  rate  of  h. 


the  error  by 


m  =  -  E  j(mv)  ±  yT) 


(22.32) 


where  m  is  the  size  of  the  validation  set.  See  Figure  22.6. 

Another  approach  to  cross-validation  is  K-fold  cross-validation  which  is 
obtained  from  the  following  algorithm. 


iC-fold  cross-validation. 

1.  Randomly  divide  the  data  into  K  chunks  of  approximately  equal  size. 

A  common  choice  is  K  =  10. 

2.  For  k  =  1  to  K,  do  the  following: 

(a)  Delete  chunk  k  from  the  data. 

(b)  Compute  the  classifier  h^)  from  the  rest  of  the  data. 

(c)  Use  to  the  predict  the  data  in  chunk  k.  Let  denote 


the  observed  error  rate. 


3.  Let 


^  1  /y  ^ 

L(h)  =  l?z2L{k)- 


(22.33) 


k= 1 


22.15  Example.  We  applied  10-fold  cross-validation  to  the  heart  disease  data. 
The  minimum  cross-validation  error  as  a  function  of  the  number  of  leaves 
occurred  at  six.  Figure  22.7  shows  the  tree  with  six  leaves.  ■ 
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FIGURE  22.7.  Smaller  classification  tree  with  size  chosen  by  cross-validation. 

Probability  Inequalities.  Another  approach  to  estimating  the  error  rate 
is  to  find  a  confidence  interval  for  Ln(h )  using  probability  inequalities.  This 
method  is  useful  in  the  context  of  empirical  risk  minimization. 

Let  1~L  be  a  set  of  classifiers,  for  example,  all  linear  classifiers.  Empirical  risk 
minimization  means  choosing  the  classifier  h  E  T~L  to  minimize  the  training 
error  Ln(h),  also  called  the  empirical  risk.  Thus, 

h  =  argmin h€HLn(h)  =  argmin^eW  R  TTMY)  ^  V)j  .  (22.34) 

Typically,  Ln{h )  underestimates  the  true  error  rate  L{h)  because  h  was  chosen 
to  make  Ln(h )  small.  Our  goal  is  to  assess  how  much  underestimation  is  taking 
place.  Our  main  tool  for  this  analysis  is  Hoeffding’s  inequality  (Theorem 
4.5).  Recall  that  if  Xi, . . . ,  Xn  ^  Bernoulli(p),  then,  for  any  e  >  0, 


F (| p  —  p\  >  e)  <  2e 


—  2  ne" 


(22.35) 


where  p  =  n  Ti=Ti' 
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First,  suppose  that  TL  =  {hi, . . . ,  hm}  consists  of  finitely  many  classifiers. 
For  any  fixed  h,  Ln(h )  converges  in  almost  surely  to  L(h )  by  the  law  of  large 
numbers.  We  will  now  establish  a  stronger  result. 

22.16  Theorem  (Uniform  Convergence).  Assume  TL  is  finite  and  has  m  ele¬ 
ments.  Then, 


P  (  max 
\hen 


Ln(h)  —  L(h)  |  >  e  )  <  2 me 


—  2  ne" 


Proof.  We  will  use  Hoeffding’s  inequality  and  we  will  also  use  the  fact 
that  if  Ai, . . . ,  Am  is  a  set  of  events  then  P((J^Li  Af)  <  Now, 


P  (  max  Ln(h )  —  L(h)\  >  e 
1  hen 


P|  U  I Ln(h) 

Khen 


L(h)  |  >  e 


< 


< 


Ln(h)  —  L(h)  |  >  e 
2e-W  =  2 me'2”2. 

Hen 


Ep 

Hen 


22.17  Theorem.  Let 


€=,/- l0g(  — 

n  \  a 


Then  Ln(h )  ±  e  is  a  1  —  a  confidence  interval  for  L(h ) 


Proof.  This  follows  from  the  fact  that 


¥(\Ln(h)  -  L(h)\  >e)  <  P  (max|Ln(h)  -  L(h)\  >e) 

\hen  J 


<  2  me 


—  2  ne‘ 


=  a, 


When  TL  is  large  the  confidence  interval  for  L(h)  is  large.  The  more  functions 
there  are  in  TL  the  more  likely  it  is  we  have  “overfit”  which  we  compensate 
for  by  having  a  larger  confidence  interval. 

In  practice  we  usually  use  sets  TL  that  are  infinite,  such  as  the  set  of  linear 
classifiers.  To  extend  our  analysis  to  these  cases  we  want  to  be  able  to  say 
something  like 


P 


sup 

hen 


Ln(h)  -  L(h)\  >  e 


<  something  not  too  big. 


One  way  to  develop  such  a  generalization  is  by  way  of  the  Vapnik-Chervonenkis 
or  VC  dimension. 
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Let  A  be  a  class  of  sets.  Give  a  finite  set  F  =  {#i, . . . ,  xn}  let 

Na{F)  =  :  AeA\  (22.36) 

be  the  number  of  subsets  of  F  “picked  out”  by  A.  Here  (f(B)  denotes  the 
number  of  elements  of  a  set  B.  The  shatter  coefficient  is  defined  by 

s(A,n)  =  max  N^(F)  (22.37) 

where  Tn  consists  of  all  finite  sets  of  size  n.  Now  let  Xi, . . . ,  Xn  ^  IP  and  let 

Fn(A)  =  -  V/(V  €  A) 
n 

i 

denote  the  empirical  probability  measure.  The  following  remarkable  the¬ 
orem  bounds  the  distance  between  P  and  Pn. 

22.18  Theorem  (Vapnik  and  Chervonenkis  (1971)).  For  any  P,  n  and  e  >  0; 

p{  sup  |Pn(A)  —  P(A)|  >  el  <  8s  (A,  n)e~ne  Z32.  (22.38) 

l  AeA  J 

The  proof,  though  very  elegant,  is  long  and  we  omit  it.  If  1~L  is  a  set  of 
classifiers,  define  A  to  be  the  class  of  sets  of  the  form  {x  :  h{x)  =  1}.  We 
then  define  s(%,  n)  =  s(*4,  n). 

22.19  Theorem. 


P  <  sup  \Ln(h)  —  L(h) |  >  e  >  <  8s(77,  n)e  ne  Z32, 
[hen  J 


A  l  —  a  confidence  interval  for  L(h )  is  Ln(h )  ±  en  where 

2  32 ,  /8s(77,n)\ 


2  32  , 

:n  =  —  log 

n 


These  theorems  are  only  useful  if  the  shatter  coefficients  do  not  grow  too 
quickly  with  n.  This  is  where  VC  dimension  enters. 


22.20  Definition.  The  VC  (Vapnik- Chervonenkis)  dimension  of  a  class  of 
sets  A  is  defined  as  follows.  If  s(A,n)  =  2n  for  all  n,  set  VC  (A)  =  oo. 
Otherwise ,  define  VC  (A)  to  be  the  largest  k  for  which  s(A,n)  =  2k . 

Thus,  the  VC-dimension  is  the  size  of  the  largest  finite  set  F  that  can  be 
shattered  by  A  meaning  that  A  picks  out  each  subset  of  F.  If  TL  is  a  set  of 
classifiers  we  define  VC  ((H)  =  VC  (A)  where  A  is  the  class  of  sets  of  the  form 
{x  :  h(x)  =  1}  as  h  varies  in  H.  The  following  theorem  shows  that  if  A  has 
finite  VC-dimension,  then  the  shatter  coefficients  grow  as  a  polynomial  in  n. 
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22.21  Theorem.  If  A  has  finite  VC-dimension  v,  then 


s(*4,  n)  <  nv  +  1. 


22.22  Example.  Let  A  =  {(  —oo,  a];  a  E  7£}.  The  *4.  shatters  every  1-point 
set  {x}  but  it  shatters  no  set  of  the  form  {x,y}.  Therefore,  VC  (A)  =  1.  ■ 


22.23  Example.  Let  A  be  the  set  of  closed  intervals  on  the  real  line.  Then 
A  shatters  S  =  {#,  y}  but  it  cannot  shatter  sets  with  3  points.  Consider 
S  =  {x,  y,  z}  where  x  <  y  <  z.  One  cannot  find  an  interval  A  such  that 
Af]S  =  {x,z}.  So,  VC  (A)  =  2.  ■ 


22.24  Example.  Let  A  be  all  linear  half-spaces  on  the  plane.  Any  3-point 
set  (not  all  on  a  line)  can  be  shattered.  No  4  point  set  can  be  shattered. 
Consider,  for  example,  4  points  forming  a  diamond.  Let  T  be  the  left  and 
rightmost  points.  This  can’t  be  picked  out.  Other  configurations  can  also  be 
seen  to  be  unshatterable.  So  VC  (A)  =  3.  In  general,  halfspaces  in  IZd  have 
VC  dimension  d  +  1.  ■ 


22.25  Example.  Let  A  be  all  rectangles  on  the  plane  with  sides  parallel  to 
the  axes.  Any  4  point  set  can  be  shattered.  Let  S'  be  a  5  point  set.  There  is 
one  point  that  is  not  leftmost,  rightmost,  uppermost,  or  lowermost.  Let  T  be 
all  points  in  S  except  this  point.  Then  T  can’t  be  picked  out.  So  VC  (A)  =  4. 


22.26  Theorem.  Let  x  have  dimension  d  and  let  TL  be  th  set  of  linear  classi¬ 
fiers.  The  VC-dimension  ofTL  is  d  A  1.  Hence ,  a  1  —  a  confidence  interval  for 
the  true  error  rate  is  L(h)  ±  e  where 


e 


2 

n 


8  (nd+1  +  1)\ 


a 


22.9  Support  Vector  Machines 

In  this  section  we  consider  a  class  of  linear  classifiers  called  support  vector 
machines.  Throughout  this  section,  we  assume  that  Y  is  binary.  It  will  be 
convenient  to  label  the  outcomes  as  —1  and  +1  instead  of  0  and  1.  A  linear 
classifier  can  then  be  written  as 
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where  x  =  (#i, . . . ,  Xd) 


H(x)  =  ao  + 


and 


sign(z) 


/  1 

-1  if  z  <  0 
0  ii  z  =  0 
1  if  z  >  0. 


First,  suppose  that  the  data  are  linearly  separable,  that  is,  there  exists 
a  hyperplane  that  perfectly  separates  the  two  classes. 

22.27  Lemma.  The  data  can  be  separated  by  some  hyperplane  if  and  only  if 
there  exists  a  hyperplane  H(x)  =  ao  +  Yli=i  aixi  su°h  that 


YiH(xi)  >  1,  i  =  1, 


(22.39) 


Proof.  Suppose  the  data  can  be  separated  by  a  hyperplane  W{x)  =  bo  + 
£<=1  bixi ■  It  follows  that  there  exists  some  constant  c  such  that  Yi  =  1  implies 
W(Xi)  >  c  and  Yi  =  —1  implies  W(Xi)  <  —  c.  Therefore,  YiW(Xi)  >  c  for 
all  i.  Let  H(x)  =  ao  +  Yli=i  aixi  where  aj  =  bj/c.  Then  YiH^Xf)  >  1  for  all 
i.  The  reverse  direction  is  straightforward.  ■ 

In  the  separable  case,  there  will  be  many  separating  hyperplanes.  How 
should  we  choose  one?  Intuitively,  it  seems  reasonable  to  choose  the  hyper¬ 
plane  “furthest”  from  the  data  in  the  sense  that  it  separates  the  +ls  and  -Is 
and  maximizes  the  distance  to  the  closest  point.  This  hyperplane  is  called  the 
maximum  margin  hyperplane.  The  margin  is  the  distance  to  from  the 
hyperplane  to  the  nearest  point.  Points  on  the  boundary  of  the  margin  are 
called  support  vectors.  See  Figure  22.8. 

22.28  Theorem.  The  hyperplane  H(x)  =  ao  +  Yli=i®ixi  that  separates  the 

data  and  maximizes  the  margin  is  given  by  minimizing  (1/2)  °y  subject 

to  (22.39). 

It  turns  out  that  this  problem  can  be  recast  as  a  quadratic  programming 
problem.  Let  =  Xj denote  the  inner  product  of  Xi  and  Xj^. 

22.29  Theorem.  Let  H(x)  =  CLo~\~Yli=i^iXi  denote  the  optimal  (largest  mar¬ 
gin)  hyperplane.  Then ,  for  j  =  1, . . . ,  d, 

n 

aj  —  ^  ^ otiTjX j{L) 
i=  1 
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FIGURE  22.8.  The  hyperplane  H(x)  has  the  largest  margin  of  all  hyperplanes  that 
separate  the  two  classes. 


where  Xj(i)  is  the  value  of  the  covariate  Xj  for  the  ith  data  point ,  and  a  = 
(2i, . . . ,  an)  is  the  vector  that  maximizes 

n  _.  n  n 

J2ai~2^^aiakYiYk{XuXk)  (22.40) 

i= 1  i= 1  k= 1 


subject  to 


and 


ou  >  0 


0  —  ^  ^ 


The  points  Xi  for  which  5^0  are  called  support  vectors,  ao  can  be  found 
by  solving 

ai^Yi(Xj'a  +  j30\  =0 

for  any  support  point  Xi.  H  may  be  written  as 


n 


H(x)  =  S0  +  Xi 


i= 1 


There  are  many  software  packages  that  will  solve  this  problem  quickly.  If 
there  is  no  perfect  linear  classifier,  then  one  allows  overlap  between  the  groups 
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by  replacing  the  condition  (22.39)  with 

YiH(xi)  >1  -  £i,  £i>  0,  i  =  l,...,n.  (22.41) 

The  variables  £i, . . . ,  £n  are  called  slack  variables. 

We  now  maximize  (22.40)  subject  to 

0  <  &  <  c,  i  =  1 , . . . ,  n 

and 

n 

^  ^  OtjYi  0* 
i=  1 

The  constant  c  is  a  tuning  parameter  that  controls  the  amount  of  overlap. 


22.10  Kernelization 

There  is  a  trick  called  kernelization  for  improving  a  computationally  simple 
classifier  h.  The  idea  is  to  map  the  covariate  X  —  which  takes  values  in  X  — 
into  a  higher  dimensional  space  Z  and  apply  the  classifier  in  the  bigger  space 
Z.  This  can  yield  a  more  flexible  classifier  while  retaining  computationally 
simplicity. 

The  standard  example  of  this  idea  is  illustrated  in  Figure  22.9.  The  covariate 
x  =  (#i,  #2).  The  Y^s  can  be  separated  into  two  groups  using  an  ellipse.  Define 
a  mapping  <j)  by 

2  =  (zi,z2,z3)  =  <j)(x)  =  (xl,V2x  1X2,  xl). 

Thus,  <j)  maps  X  =  M2  into  Z  =  R3.  In  the  higher-dimensional  space  Z,  the 
Yi  s  are  separable  by  a  linear  decision  boundary.  In  other  words, 

a  linear  classifier  in  a  higher-dimensional  space  corresponds  to  a  non¬ 
linear  classifier  in  the  original  space. 

The  point  is  that  to  get  a  richer  set  of  classifiers  we  do  not  need  to  give  up  the 
convenience  of  linear  classifiers.  We  simply  map  the  covariates  to  a  higher¬ 
dimensional  space.  This  is  akin  to  making  linear  regression  more  flexible  by 
using  polynomials. 

There  is  a  potential  drawback.  If  we  significantly  expand  the  dimension 
of  the  problem,  we  might  increase  the  computational  burden.  For  example, 
if  x  has  dimension  d  =  256  and  we  wanted  to  use  all  fourth-order  terms, 
then  z  =  cj)(x)  has  dimension  183,181,376.  We  are  spared  this  computational 
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FIGURE  22.9.  Kernelization.  Mapping  the  covariates  into  a  higher-dimensional 
space  can  make  a  complicated  decision  boundary  into  a  simpler  decision  bound¬ 
ary. 

nightmare  by  the  following  two  facts.  First,  many  classifiers  do  not  require 
that  we  know  the  values  of  the  individual  points  but,  rather,  just  the  inner 
product  between  pairs  of  points.  Second,  notice  in  our  example  that  the  inner 
product  in  Z  can  be  written 

{z,z)  =  (0(x),0(x)) 

=  x\x\  +  2X\XiX2X2  +  x\x\ 

=  ((#,  x))2  =  K(x,  x). 

Thus,  we  can  compute  (z,i)  without  ever  computing  Zi  =  (j)(Xi). 

To  summarize,  kernelization  involves  finding  a  mapping  <j)  :  T  — )►  Z  and  a 
classifier  such  that: 

1.  Z  has  higher  dimension  than  X  and  so  leads  a  richer  set  of  classifiers. 

2.  The  classifier  only  requires  computing  inner  products. 

3.  There  is  a  function  iC,  called  a  kernel,  such  that  (</>(#),  </>(#))  =  K(x,x). 

4.  Everywhere  the  term  (x,  x)  appears  in  the  algorithm,  replace  it  with 
iC(x,  x). 


22.10  Kernelization 
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In  fact,  we  never  need  to  construct  the  mapping  at  all.  We  only  need 
to  specify  a  kernel  K(x,x)  that  corresponds  to  (</>(#),  </>(#))  for  some  <j).  This 
raises  an  interesting  question:  given  a  function  of  two  variables  K(x,y),  does 
there  exist  a  function  (j)(x)  such  that  K(x,y)  =  (</>(#),  (p(y))?  The  answer  is 
provided  by  Mercer’s  theorem  which  says,  roughly,  that  if  K  is  positive 
definite  —  meaning  that 


/  /  K(x,y)f(x)f(y)dxdy  >  0 


for  square  integrable  functions  / 
monly  used  kernels  are: 

polynomial  K(x,x) 
sigmoid  K(x,x) 
Gaussian  K(x,x) 


then  such  a  <f>  exists.  Examples  of  com- 


r 


^x,  x)  +  a 

/ 

tanh(a(x,  x)  +  b) 


exp 


_  ry* 


V  (2c2) 


Let  us  now  see  how  we  can  use  this  trick  in  LDA  and  in  support  vector 
machines. 

Recall  that  the  Fisher  linear  discriminant  method  replaces  X  with  U  = 
wT X  where  w  is  chosen  to  maximize  the  Rayleigh  coefficient 

wT  SbU) 
wTSww>  ’ 

SB  =  (Xo-X^o-Xif 


and 


Sw  = 


(n0  -  1  )S0 


+ 


(ni  -  l)5i 


(n0  -  1)  +  (ni  -  1) )  \  (n0  -  1)  +  (nx  -  1) 

In  the  kernelized  version,  we  replace  JQ  with  Z i  =  (j)(Xj)  and  we  find  w  to 

maximize  _ 

wT  SbU) 


J(w)  = 


wTSww 


(22.42) 


where 


and 


SB  =  (Zo-Z1)(Zo-Z1) 


T 


Sw  = 


(n0  -  1)S0 


(n0  -  1)  +  (ni  -  1) 


+ 


(ni  -  l)Si 


(n0  -  1)  +  (ni  -  1)  / 


Here,  Sj  is  the  sample  of  covariance  of  the  s  for  which  Y  =  j.  However,  to 
take  advantage  of  kernelization,  we  need  to  re-express  this  in  terms  of  inner 
products  and  then  replace  the  inner  products  with  kernels. 
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It  can  be  shown  that  the  maximizing  vector  w  is  a  linear  combination  of 
the  Zi  s.  Hence  we  can  write 

n 

w  =  oi%Zi. 

7=  1 


Also, 

Tl 

I  L  n 

3  7=1 

Therefore, 

vFZj  =  ata*z')  (- E  7>(V)/(X  =  A) 

V=i  /  \nj  i=1  / 

1  n  n 
77/  o 

7=1  s  =  l 
n  n 

=  -  E  E  7(y«  =  j)HXi)T<P(Xs) 

77/  o 

7=1  s = 1 
1  n  n 

=  -EaiE7(^=7)w>v) 

77/  o 

T'  7=1  s=l 

=  aT  Mj 


where  Mj  is  a  vector  whose  7th  component  is 

Tl 

Mj(i)  =  —J2K(Xi,Xs)I(Yi=j). 

Tij 

J  ,s  =  l 


It  follows  that 
where  M  =  (Mo 


rri  '  rri 

w  Sbw  =  a  M a 

Mi) (Mq  —  Mi)T.  By  similar  calculations,  we  can  write 


rri  '  rri 

w  Sww  =  a  Na 

where 

N  =  K0(l-  —  l)  Kl  +  K,  (i  -  —  l)  Kf, 

V  n0  J  \  ni  J 

I  is  the  identity  matrix,  1  is  a  matrix  of  all  one’s,  and  Kj  is  the  n  x  rij 
matrix  with  entries  ( Kj)rs  =  K(xr,xs )  with  xs  varying  over  the  observations 
in  group  j.  Hence,  we  now  find  a  to  maximize 

aT  Ma 
aTNa 
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All  the  quantities  are  expressed  in  terms  of  the  kernel.  Formally,  the  solution 
is  a  =  N~1(Mq  —  Mi).  However,  N  might  be  non-invertible.  In  this  case  one 
replaces  N  by  N  +  57,  for  some  constant  b.  Finally,  the  projection  onto  the 
new  subspace  can  be  written  as 

n 

U  =  wT(j)(x)  =  ajK(xj ,  x). 

i=  1 

The  support  vector  machine  can  similarly  be  kernelized.  We  simply  replace 
(Xi,  Xj)  with  K(Xi ,  Xj).  For  example,  instead  of  maximizing  (22.40),  we  now 
maximize 

n  ^  n  n 

-^2^2aiakYiYkK{Xi,X ,•).  (22.43) 

i= 1  i=  1  /c= 1 

The  hyperplane  can  be  written  as  if(x)  =  ao  +  (X,  A}). 


22.11  Other  Classifiers 


There  are  many  other  classifiers  and  space  precludes  a  full  discussion  of  all  of 
them.  Let  us  briefly  mention  a  few. 

The  k-nearest-neighbors  classifier  is  very  simple.  Given  a  point  x,  find 
the  k  data  points  closest  to  x.  Classify  x  using  the  majority  vote  of  these  k 
neighbors.  Ties  can  be  broken  randomly.  The  parameter  k  can  be  chosen  by 
cross-validation. 

Bagging  is  a  method  for  reducing  the  variability  of  a  classifier.  It  is  most 
helpful  for  highly  nonlinear  classifiers  such  as  trees.  We  draw  B  bootstrap 
samples  from  the  data.  The  5th  bootstrap  sample  yields  a  classifier  h^.  The 
final  classifier  is 

h(x)  =  {  1  if  i  SG  hbW  -  5 

I  0  otherwise. 


Boosting  is  a  method  for  starting  with  a  simple  classifier  and  gradually 
improving  it  by  refitting  the  data  giving  higher  weight  to  misclassified  samples. 
Suppose  that  H  is  a  collection  of  classifiers,  for  example,  trees  with  only 
one  split.  Assume  that  Yi  E  {  —  1,1}  and  that  each  h  is  such  that  h{x)  E 
{—1,1}.  We  usually  give  equal  weight  to  all  data  points  in  the  methods  we 
have  discussed.  But  one  can  incorporate  unequal  weights  quite  easily  in  most 
algorithms.  For  example,  in  constructing  a  tree,  we  could  replace  the  impurity 
measure  with  a  weighted  impurity  measure.  The  original  version  of  boosting, 
called  AdaBoost,  is  as  follows. 
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1.  Set  the  weights  Wi  =  1/n,  i  =  1, . . . ,  n. 

2.  For  j  =  1, . . . ,  J,  do  the  following  steps: 

(a)  Constructing  a  classifier  hj  from  the  data  using  the  weights 

(b)  Compute  the  weighted  error  estimate: 

f  EhmK Yi  +  h^x,)) 

^3  sr^n 

£i= 1  wi 

(c)  Let  aj  =  log((l  —  Lj)/Lj). 

(d)  Update  the  weights: 

Wi  < —  WieajI(Yi^hj(.Xi)) 


3.  The  final  classifier  is 


There  is  now  an  enormous  literature  trying  to  explain  and  improve  on 
boosting.  Whereas  bagging  is  a  variance  reduction  technique,  boosting  can 
be  thought  of  as  a  bias  reduction  technique.  We  starting  with  a  simple  — 
and  hence  highly-biased  —  classifier,  and  we  gradually  reduce  the  bias.  The 
disadvantage  of  boosting  is  that  the  final  classifier  is  quite  complicated. 

Neural  Networks  are  regression  models  of  the  form  3 

p 

Y  =  A)  +  ^2  Pja(a o  +  oiTX) 

3  =  1 

where  a  is  a  smooth  function,  often  taken  to  be  a(v)  =  ev/(l  +  ev).  This 
is  really  nothing  more  than  a  nonlinear  regression  model.  Neural  nets  were 
fashionable  for  some  time  but  they  pose  great  computational  difficulties.  In 
particular,  one  often  encounters  multiple  minima  when  trying  to  find  the  least 
squares  estimates  of  the  parameters.  Also,  the  number  of  terms  p  is  essentially 
a  smoothing  parameter  and  there  is  the  usual  problem  of  trying  to  choose  p 
to  find  a  good  balance  between  bias  and  variance. 


‘This  is  the  simplest  version  of  a  neural  net.  There  are  more  complex  versions  of  the  model. 
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22.12  Bibliographic  Remarks 

The  literature  on  classification  is  vast  and  is  growing  quickly.  An  excellent 
reference  is  Hastie  et  al.  (2001).  For  more  on  the  theory,  see  Devroye  et  al. 
(1996)  and  Vapnik  (1998).  Two  recent  books  on  kernels  are  Scholkopf  and 
Smola  (2002)  and  Herbich  (2002). 


22.13  Exercises 

1.  Prove  Theorem  22.5. 

2.  Prove  Theorem  22.7. 

3.  Download  the  spam  data  from: 


http://www-stat.stanford.edu/Mabs/ElemStatLearn/index.html 


The  data  file  can  also  be  found  on  the  course  web  page.  The  data  con¬ 
tain  57  covariates  relating  to  email  messages.  Each  email  message  was 
classified  as  spam  (Y=l)  or  not  spam  (Y=0).  The  outcome  Y  is  the  last 
column  in  the  file.  The  goal  is  to  predict  whether  an  email  is  spam  or 
not. 

(a)  Construct  classification  rules  using  (i)  LDA,  (ii)  QDA,  (iii)  logistic 
regression,  and  (iv)  a  classification  tree.  For  each,  report  the  observed 
misclassification  error  rate  and  construct  a  2-by-2  table  of  the  form 


h(x)  =  0 

h(x)  =  1 

Y  =  0 

?? 

?? 

Y  =  1 

?? 

?? 

(b)  Use  5-fold  cross-validation  to  estimate  the  prediction  accuracy  of 
LDA  and  logistic  regression. 

(c)  Sometimes  it  helps  to  reduce  the  number  of  covariates.  One  strategy 
is  to  compare  Xi  for  the  spam  and  email  group.  For  each  of  the  57 
covariates,  test  whether  the  mean  of  the  covariate  is  the  same  or  different 
between  the  two  groups.  Keep  the  10  covariates  with  the  smallest  p- 
values.  Try  LDA  and  logistic  regression  using  only  these  10  variables. 
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4.  Let  A  be  the  set  of  two-dimensional  spheres.  That  is,  A  E  A  if  A  = 
{(x,y)  :  (x  —  a)2  +  (i/  —  fr)2  <  c2}  for  some  a,  fr,  c.  Find  the  VC-dimension 
of  A. 

5.  Classify  the  spam  data  using  support  vector  machines.  Free  software  for 
the  support  vector  machine  is  at  http://svmlight.joachims.org/ 

6.  Use  VC  theory  to  get  a  confidence  interval  on  the  true  error  rate  of  the 
LDA  classifier  for  the  iris  data  (from  the  book  web  site). 

7.  Suppose  that  G  M  and  that  Yi  =  1  whenever  | X/  <  1  and  Y*  =  0 
whenever  |  X/  >  1.  Show  that  no  linear  classifier  can  perfectly  classify 
these  data.  Show  that  the  kernelized  data  Zi  =  (X^,  X2)  can  be  linearly 
separated. 

8.  Repeat  question  5  using  the  kernel  K(x,x)  =  (1  +  xTx)p.  Choose  p  by 
cross-validation. 

9.  Apply  the  k  nearest  neighbors  classifier  to  the  “iris  data.”  Choose  k  by 
cross-validation. 

10.  (Curse  of  Dimensionality.)  Suppose  that  X  has  a  uniform  distribution 
on  the  d-dimensional  cube  [—1/2, 1/2]  A  Let  R  be  the  distance  from  the 
origin  to  the  closest  neighbor.  Show  that  the  median  of  R  is 


M1) 


where 

7T<i/2 

“i(r)  = r  r(W2)  + 1) 

is  the  volume  of  a  sphere  of  radius  r.  For  what  dimension  d  does  the 
median  of  R  exceed  the  edge  of  the  cube  when  n  =  100,  n  =  1,000, 
n  =  10,000?  (Hastie  et  al.  (2001),  p.  22-27.) 

11.  Fit  a  tree  to  the  data  in  question  3.  Now  apply  bagging  and  report  your 
results. 

12.  Fit  a  tree  that  uses  only  one  split  on  one  variable  to  the  data  in  question 
3.  Now  apply  boosting. 
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13.  Let  r(x)  =  P(H 
the  classifier 


l\X  =  x)  and  let  r(x)  be  an  estimate  of  r(x).  Consider 


h(x) 


1  if  r(x)  >1/2 
0  otherwise. 


Assume  that  r(x)  ~  N(r(x),  a2(x))  for  some  functions  r(x)  and  a2(x), 


Show  that,  for  fixed  x. 


¥(Y  /  h{x))  ±  h*(x)) 


+ 


2  r(x)  —  1 

X 

i  -  $ 

.  V 

sign^r(x)  —  (l/2))(r(x)  —  (1/2) )  ^ 


cr(x) 


where  is  the  standard  Normal  CDF  and  h*  is  the  Bayes  rule.  Regard 

sign^(r(x)  —  (1/2)) (r(x)  —  (1/2))^  as  a  type  of  bias  term.  Explain  the 

implications  for  the  bias-variance  tradeoff  in  classification  (Friedman 
(1997)). 

Hint:  first  show  that 


¥(Y  +  h(x))  =  |2 r(x)  -  l|P(ft(x)  +  h*(x))  +  P(F  ^  h*(x)). 
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23.1  Introduction 

Most  of  this  book  has  focused  on  HD  sequences  of  random  variables.  Now  we 
consider  sequences  of  dependent  random  variables.  For  example,  daily  tem¬ 
peratures  will  form  a  sequence  of  time-ordered  random  variables  and  clearly 
the  temperature  on  one  day  is  not  independent  of  the  temperature  on  the 
previous  day. 

A  stochastic  process  {Xt  :  t  E  T}  is  a  collection  of  random  variables. 
We  shall  sometimes  write  X(t)  instead  of  Xt.  The  variables  Xt  take  values  in 
some  set  X  called  the  state  space.  The  set  T  is  called  the  index  set  and 
for  our  purposes  can  be  thought  of  as  time.  The  index  set  can  be  discrete 
T  =  {0, 1,  2, . . .}  or  continuous  T  =  [0,  oo)  depending  on  the  application. 

23.1  Example  (iid  observations).  A  sequence  of  HD  random  variables  can  be 
written  as  {Xt  :  i  G  T}  where  T  =  {1,2,  3, ...,}.  Thus,  a  sequence  of  iid 
random  variables  is  an  example  of  a  stochastic  process.  ■ 

23.2  Example  (The  Weather).  Let  A  =  {sunny,  cloudy}.  A  typical  sequence 
(depending  on  where  you  live)  might  be 

sunny,  sunny,  cloudy,  sunny,  cloudy,  cloudy,  •  •  • 

This  process  has  a  discrete  state  space  and  a  discrete  index  set.  ■ 
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time 

FIGURE  23.1.  Stock  price  over  ten  week  period. 


23.3  Example  (Stock  Prices).  Figure  23.1  shows  the  price  of  a  fictitious  stock 
over  time.  The  price  is  monitored  continuously  so  the  index  set  T  is  continuous. 
Price  is  discrete  but  for  all  practical  purposes  we  can  treat  it  as  a  continuous 
variable.  ■ 


23.4  Example  (Empirical  Distribution  Function).  Let  Xi,...,Xn  ^  F  where 
F  is  some  CDF  on  [0,1].  Let 

TL 

Fn(t)  =  ~  J'l(Xi<t) 

n 

i=  1 

be  the  empirical  CDF.  For  any  fixed  value  t,  Fn(t)  is  a  random  variable.  But 
the  whole  empirical  CDF 

<,Fn(t)  :  t  e  [0, 1]  j 

is  a  stochastic  process  with  a  continuous  state  space  and  a  continuous  index 
set.  ■ 

We  end  this  section  by  recalling  a  basic  fact.  If  Xi,...,Xn  are  random 
variables,  then  we  can  write  the  joint  density  as 

f(x Xn)  =  f(xi)f(x2\xi)  ■  ■  -f(xn\xi,  .  .  .  ,x„-l) 

n 

=  Repast,) 

i=  1 


(23.1) 


23.2  Markov  Chains 
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where  past^  =  (Xi, . . . ,  X^_ i). 

23.2  Markov  Chains 

A  Markov  chain  is  a  stochastic  process  for  which  the  distribution  of  Xt  de¬ 
pends  only  on  Xt-\.  In  this  section  we  assume  that  the  state  space  is  dis¬ 
crete,  either  A  =  {1, . . . ,  N}  or  A  =  {1,2,...,}  and  that  the  index  set  is 
T  =  {0,1,2,...}.  Typically,  most  authors  write  Xn  instead  of  Xt  when  dis¬ 
cussing  Markov  chains  and  I  will  do  so  as  well. 

23.5  Definition.  The  process  {Xn  :  n  E  T}  is  a  Markov  chain  if 

P(X„  =  x  I  X0, V„_i)  =  P(V„  =  x  I  X„_i)  (23.2) 

for  all  n  and  for  all  x  E  X . 

For  a  Markov  chain,  equation  (23.1)  simplifies  to 

f{x  1,  ...,Xn)=  f{x1)f{x2\x1)f{X3\x2)  ■  '  ■  f(xn\xn-i). 

A  Markov  chain  can  be  represented  by  the  following  DAG: 

Xq  X\  X2  Xn  •  •  • 

Each  variable  has  a  single  parent,  namely,  the  previous  observation. 

The  theory  of  Markov  chains  is  a  very  rich  and  complex.  We  have  to  get 
through  many  definitions  before  we  can  do  anything  interesting.  Our  goal  is 
to  answer  the  following  questions: 

1.  When  does  a  Markov  chain  “settle  down”  into  some  sort  of  equilibrium? 

2.  How  do  we  estimate  the  parameters  of  a  Markov  chain? 

3.  How  can  we  construct  Markov  chains  that  converge  to  a  given  equilib¬ 
rium  distribution  and  why  would  we  want  to  do  that? 

We  will  answer  questions  1  and  2  in  this  chapter.  We  will  answer  question 
3  in  the  next  chapter.  To  understand  question  1,  look  at  the  two  chains  in 
Figure  23.2.  The  first  chain  oscillates  all  over  the  place  and  will  continue  to 
do  so  forever.  The  second  chain  eventually  settles  into  an  equilibrium.  If  we 
constructed  a  histogram  of  the  first  process,  it  would  keep  changing  as  we  got 
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more  and  more  observations.  But  a  histogram  from  the  second  chain  would 
eventually  converge  to  some  fixed  distribution. 


FIGURE  23.2.  Two  Markov  chains.  The  first  chain  does  not  settle  down  into  an 
equilibrium.  The  second  does. 


Transition  Probabilities.  The  key  quantities  of  a  Markov  chain  are  the 
probabilities  of  jumping  from  one  state  into  another  state.  A  Markov  chain  is 
homogeneous  if  P(Xn+i  =  j\Xn  =  i)  does  not  change  with  time.  Thus,  for 
a  homogeneous  Markov  chain,  P(Xn+i  =  j\Xn  =  i)  =  P(Xi  =  j \Xq  =  i).  We 
shall  only  deal  with  homogeneous  Markov  chains. 


23.6  Definition.  We  call 


Pij  =  P(W+i 


*) 


(23.3) 


the  transition  probabilities.  The  matrix  P  whose  (i,j)  element  is 
is  called  the  transition  matrix. 


We  will  only  consider  homogeneous  chains.  Notice  that  P  has  two  proper¬ 
ties:  (i)  >  0  and  (ii)  ^2iPij  =  1.  Each  row  can  be  regarded  as  a  probability 

mass  function. 

23.7  Example  (Random  Walk  With  Absorbing  Barriers).  Let  X  =  {1, . . . ,  N}. 

Suppose  you  are  standing  at  one  of  these  points.  Flip  a  coin  with  P(Heads)  =  p 
and  P(Tails)  =  q  =  1  —  p.  If  it  is  heads,  take  one  step  to  the  right.  If  it  is 
tails,  take  one  step  to  the  left.  If  you  hit  one  of  the  endpoints,  stay  there.  The 
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transition  matrix  is 

1  0  0  0  •••  0  0 

q  0  p  0  •  •  •  0  0 

0  q  0  p  •  •  •  0  0 


0  0  0  0  q  0  p 

0  0  0  0  0  0  1 

23.8  Example.  Suppose  the  state  space  is  X  =  {sunny,  cloudy}.  Then  Xi, 

X2,  ...  represents  the  weather  for  a  sequence  of  days.  The  weather  today 
clearly  depends  on  yesterday’s  weather.  It  might  also  depend  on  the  weather 
two  days  ago  but  as  a  first  approximation  we  might  assume  that  the  depen¬ 
dence  is  only  one  day  back.  In  that  case  the  weather  is  a  Markov  chain  and  a 
typical  transition  matrix  might  be 

Sunny  Cloudy 
Sunny  0.4  0.6 

Cloudy  0.8  0.2 

For  example,  if  it  is  sunny  today,  there  is  a  60  per  cent  chance  it  will  be  cloudy 
tomorrow.  ■ 


Let 

Pij(n)  =  V(Xm+n  =  j \Xm  =  i )  (23.4) 

be  the  probability  of  of  going  from  state  i  to  state  j  in  n  steps.  Let  Pn  be  the 
matrix  whose  (i,j)  element  is  Pij(n).  These  are  called  the  n-step  transition 
probabilities. 

23.9  Theorem  (The  Chapman-Kolmogorov  equations).  The  n-step  probabilities 
satisfy 

Pij{m  +  n)  =  Sr'^Pik('rn)pkj{n).  (23.5) 

k 

Proof.  Recall  that,  in  general, 

F(X  =  x,Y  =  y)=  F(X  =  x)F(Y  =  y\X  =  x ). 

This  fact  is  true  in  the  more  general  form 

F(X  =  x,Y  =  y\Z  =  z)=  P(X  =  x\Z  =  z)F(Y  =  y\X  =  x,  Z  =  z). 

Also,  recall  the  law  of  total  probability: 

P(x  =  x)  =  y>(V  =  x,Y  =  y). 

y 
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Using  these  facts  and  the  Markov  property  we  have 


Pij(m  +  n) 


P(Xm+n  —  j \Xo  —  i) 

^  ^  P(-^m+n  ji  Xm 


k\X0  =  i) 


k 


y^P(Xm+n  =j\Xm  =  k,X0  =  i)¥(Xm  =  k\X0  =  i) 


k 


=  J>(Xm+n  =j\Xm  =  k)¥(Xm  =  k\X0  =  i) 


k 


=  y^Pik(m)pkj(n). 


k 


Look  closely  at  equation  (23.5).  This  is  nothing  more  than  the  equation  for 
matrix  multiplication.  Hence  we  have  shown  that 


P 


m+n  — 


(23.6) 


By  definition,  Pi  =  P.  Using  the  above  theorem,  P2  =  P1+1  =  P1P1  = 
PP  =  P2.  Continuing  this  way,  we  see  that 


Pn  =  Pn=  PxPx-xP 

^ - - — : — : 

multiply  the  matrix  n  times 

Let  /in  =  (/in(l), . . . ,  /in(N))  be  a  row  vector  where 

H„(i)  =  F(Xn  =  i) 


(23.7) 


(23.8) 


is  the  marginal  probability  that  the  chain  is  in  state  i  at  time  n.  In  particular, 
Ho  is  called  the  initial  distribution.  To  simulate  a  Markov  chain,  all  you 
need  to  know  is  /xq  and  P.  The  simulation  would  look  like  this: 


Step  1  :  Draw  Xq  ~  /xq.  Thus,  P(Xq  =  i)  =  /xo(x). 

Step  2  :  Denote  the  outcome  of  step  1  by  L  Draw  X\  ^  P.  In  other  words, 
P(Xi  =  j\X0  =  i)=  Pij. 

Step  3:  Suppose  the  outcome  of  step  2  is  j.  Draw  X2  ~  P.  In  other  words, 
P(X2  =  k\X1  =  j)  =Pjk- 
And  so  on. 


It  might  be  difficult  to  understand  the  meaning  of  /xn.  Imagine  simulating 
the  chain  many  times.  Collect  all  the  outcomes  at  time  n  from  all  the  chains. 
This  histogram  would  look  approximately  like  /xn.  A  consequence  of  theorem 
23.9  is  the  following: 


23.2  Markov  Chains 


387 


23.10  Lemma.  The  marginal  probabilities  are  given  by 

Tn  /^oP  • 

Proof. 

Mj)  =  V(Xn  =  j ) 

=  ^P(Xn  =j\X0  =i)P(X0  =i) 

i 

=  ^2no(i)pij(n)  =  p0Pn.  ■ 

i 

Summary  of  Terminology 

1.  Transition  matrix:  P (i,j)  =  P(Xn+i  =  j\Xn  =  i)  =  Pij. 

2.  n-step  matrix:  P n(iJ)  =  P(Xn+m  =  j \Xm  =  i). 

3.  Pn  =  Pn. 

4.  Marginal:  fin{i)  =  P(Xn  =  i). 

Tn  MoP  • 

States.  The  states  of  a  Markov  chain  can  be  classified  according  to  various 
properties. 

23.11  Definition.  We  say  that  i  reaches  j  (or  j  is  accessible  from  i)  if 
Pij(n)  >  0  for  some  n,  and  we  write  i  -A  j.  If  i  -A  j  and  j  i  then  we 
write  i  o  j  and  we  say  that  i  and  j  communicate. 

23.12  Theorem.  The  communication  relation  satisfies  the  following  proper¬ 
ties: 

1 .  i  eo  i . 

2.  //io  j  then  joi. 

3.  If  i  o  j  and  jefc  then  i  o  k. 

If..  The  set  of  states  X  can  be  written  as  a  disjoint  union  of  classes  T  = 
ViUVaU-'-  where  two  states  i  and  j  communicate  with  each  other  if 
and  only  if  they  are  in  the  same  class. 
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If  all  states  communicate  with  each  other,  then  the  chain  is  called  irre¬ 
ducible.  A  set  of  states  is  closed  if,  once  you  enter  that  set  of  states  you 
never  leave.  A  closed  set  consisting  of  a  single  state  is  called  an  absorbing 
state. 

23.13  Example.  Let  A  =  {1,2, 3, 4}  and 

3  I  0  0 
I  I  0  0 

till 

4  4  4  4 

0  0  0  1 

The  classes  are  {1,  2},  {3}  and  {4}.  State  4  is  an  absorbing  state.  ■ 

Suppose  we  start  a  chain  in  state  i.  Will  the  chain  ever  return  to  state  i? 
If  so,  that  state  is  called  persistent  or  recurrent. 

23.14  Definition.  State  i  is  recurrent  or  persistent  if 

P(Xn  =  i  for  some  n  >  1  |  Xq  =  i)  =  1. 

Otherwise ,  state  i  is  transient. 

23.15  Theorem.  A  state  i  is  recurrent  if  and  only  if 

^2pu(n)=oo.  (23.9) 

n 

A  state  i  is  transient  if  and  only  if 

5>(n)  <  oo.  (23.10) 

n 

Proof.  Define 

1  if  Xn  =  i 

0  if  Xn  ^  i. 

The  number  of  times  that  the  chain  is  in  state  i  is  Y  =  Xi-  The  mean 

of  T,  given  that  the  chain  starts  in  state  i,  is 

oo  oo  oo 

E(Y|X0  =i)  =  ^E(/„|X0  =  i)  =  P(W  =  *1*0  =  i)  =  5>«(n). 

n= 0  n= 0  n= 0 

Define  ai  =  P(Xn  =  i  for  some  n  >  1  |  Xq  =  i).  If  i  is  recurrent,  ai  =  1.  Thus, 

the  chain  will  eventually  return  to  i.  Once  it  does  return  to  i,  we  argue  again 
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that  since  =  1,  the  chain  will  return  to  state  i  again.  By  repeating  this 
argument,  we  conclude  that  K(Y \Xq  =  i)  =  oo.  If  i  is  transient,  then  a i  <  1. 
When  the  chain  is  in  state  i,  there  is  a  probability  1  —  >0  that  it  will  never 

return  to  state  i.  Thus,  the  probability  that  the  chain  is  in  state  i  exactly  n 
times  is  a/_1(l  —  a^).  This  is  a  geometric  distribution  which  has  finite  mean. 


23.16  Theorem.  Facts  about  recurrence. 


1.  If  state  i  is  recurrent  and  i  o  j,  then  j  is  recurrent. 

2.  If  state  i  is  transient  and  i  o  j,  then  j  is  transient. 

3.  A  finite  Markov  chain  must  have  at  least  one  recurrent  state. 

4 .  The  states  of  a  finite ,  irreducible  Markov  chain  are  all  recurrent. 

23.17  Theorem  (Decomposition  Theorem).  The  state  space  X  can  be  written 
as  the  disjoint  union 

x  =  xT\Jx1\Jx2--- 

where  Xt  are  the  transient  states  and  each  Xi  is  a  closed,  irreducible  set  of 
recurrent  states. 


23.18  Example  (Random  Walk).  Let  X  =  {. . . ,  —2,  —1, 0, 1,  2, . . . ,  }  and  sup¬ 
pose  that  Pi,i+ 1  =  p,  Pi,i- i  =  q  =  1  —  p.  All  states  communicate,  hence  either 
all  the  states  are  recurrent  or  all  are  transient.  To  see  which,  suppose  we  start 
at  Xq  =  0.  Note  that 


Poo(2n) 


2  n 


n 


pnqn 


(23.11) 


since  the  only  way  to  get  back  to  0  is  to  have  n  heads  (steps  to  the  right)  and 
n  tails  (steps  to  the  left).  We  can  approximate  this  expression  using  Stirling’s 
formula  which  says  that 


n!  rsj  nn 


Inserting  this  approximation  into  (23.11)  shows  that 


Poo(2n)  ~ 


(4  pq)n 
\[rvT 


It  is  easy  to  check  that  <  00  if  and  only  if  Xln^oo(2n)  <  oo. 

Moreover,  J2nPoo(^n)  =  oo  if  and  only  if  p  =  q  =  1/2.  By  Theorem  (23.15), 
the  chain  is  recurrent  if  p  =  1/2  otherwise  it  is  transient.  ■ 
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Convergence  of  Markov  Chains.  To  discuss  the  convergence  of  chains, 
we  need  a  few  more  definitions.  Suppose  that  Xq  =  i.  Define  the  recurrence 
time 

Tij  =  min{n  >  0  :  Xn  =  j}  (23.12) 

assuming  Xn  ever  returns  to  state  i,  otherwise  define  =  oo.  The  mean 
recurrence  time  of  a  recurrent  state  i  is 

m*  =  E(Tu)  =  ^2  nfn{n)  (23.13) 

n 

where 


3,X 2  ^  j, . . .  ,xn_i  ^  j,xn  =  j \x0  =  i) 


A  recurrent  state  is  null  if  nii  =  oo  otherwise  it  is  called  non-null  or  posi¬ 
tive. 


23.19  Lemma.  If  a  state  is  null  and  recurrent,  then  0. 

23.20  Lemma.  In  a  finite  state  Markov  chain,  all  recurrent  states  are  positive. 

Consider  a  three-state  chain  with  transition  matrix 

"010" 

0  0  1. 

1  0  0 

Suppose  we  start  the  chain  in  state  1.  Then  we  will  be  in  state  3  at  times  3,  6, 
9,  ....  This  is  an  example  of  a  periodic  chain.  Formally,  the  period  of  state  i 
is  d  if  pu(n)  =  0  whenever  n  is  not  divisible  by  d  and  d  is  the  largest  integer 
with  this  property.  Thus,  d  =  gcd{n  :  pu(n)  >  0}  where  gcd  means  “greater 
common  divisor.”  State  i  is  periodic  if  d(i)  >  1  and  aperiodic  if  d(i)  =  1. 
A  state  with  period  1  is  called  aperiodic. 

23.21  Lemma.  If  state  i  has  period  d  and  i  o  j  then  j  has  period  d. 


23.22  Definition.  A  state  is  ergodic  if  it  is  recurrent,  non-null  and  I 
aperiodic.  A  chain  is  ergodic  if  all  its  states  are  ergodic. 

Let  7T  =  (iTi  :  i  E  X)  be  a  vector  of  non- negative  numbers  that  sum  to  one. 
Thus  7 r  can  be  thought  of  as  a  probability  mass  function. 

23.23  Definition.  We  say  that  tt  is  a  stationary  (or  invariant ) 

distribution  if  it  =  7rP . 
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Here  is  the  intuition.  Draw  Xo  from  distribution  7 r  and  suppose  that  tt  is  a 
stationary  distribution.  Now  draw  X\  according  to  the  transition  probability 
of  the  chain.  The  distribution  of  X\  is  then  fi \  =  fi qP  =  7rP  =  7r.  The 
distribution  of  X2  is  7rP2  =  (7rP)P  =  7rP  =  it.  Continuing  this  way,  we  see 
that  the  distribution  of  Xn  is  7rPn  =  n.  In  other  words: 

If  at  any  time  the  chain  has  distribution  tt,  then  it  will  continue  to 
have  distribution  ir  forever. 


23.24  Definition.  We  say  that  a  chain  has  limiting  distribution  if 

7 r 

P n 

7 r 

7T 

for  some  tt,  that  is,  ttj  =  limn^00  P^- 

exists  and  is  independent  of  i. 

Here  is  the  main  theorem  about  convergence.  The  theorem  says  that  an 
ergodic  chain  converges  to  its  stationary  distribution.  Also,  sample  averages 
converge  to  their  theoretical  expectations  under  the  stationary  distribution. 

23.25  Theorem.  An  irreducible ,  ergodic  Markov  chain  has  a  unique 
stationary  distribution  7 r.  The  limiting  distribution  exists  and  is  equal  to 
7r.  If  g  is  any  bounded  function,  then,  with  probability  1, 

1  N 

T  d(xn)  Ex(s)  EE  (23.14) 

n=l  j 


Finally,  there  is  another  definition  that  will  be  useful  later.  We  say  that  7 r 
satisfies  detailed  balance  if 


KiPij  =  PjiTtj.  (23.15) 

Detailed  balance  guarantees  that  7r  is  a  stationary  distribution. 

23.26  Theorem.  If  tt  satisfies  detailed  balance,  then  n  is  a  stationary  distri¬ 
bution. 

Proof.  We  need  to  show  that  7rP  =  it.  The  jth  element  of  7rP  is  JT  iriPij  = 
KjPji  =  71 j  Pji  =  71  j  •  ■ 
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The  importance  of  detailed  balance  will  become  clear  when  we  discuss 
Markov  chain  Monte  Carlo  methods  in  Chapter  24. 

Warning!  Just  because  a  chain  has  a  stationary  distribution  does  not  mean 
it  converges. 


23.27  Example.  Let 


P 


0  1  0 
0  0  1 
1  0  0 


Let  7 r  =  (1/3, 1/3, 1/3).  Then  irP  =  tt  so  7t  is  a  stationary  distribution.  If 
the  chain  is  started  with  the  distribution  i r  it  will  stay  in  that  distribution. 
Imagine  simulating  many  chains  and  checking  the  marginal  distribution  at 
each  time  n.  It  will  always  be  the  uniform  distribution  7r.  But  this  chain  does 
not  have  a  limit.  It  continues  to  cycle  around  forever.  ■ 


Examples  of  Markov  Chains. 


23.28  Example.  Let  A  =  {1, 2, 3, 4, 5, 6}.  Let 


P 


'  5  2  0  0  0 

i  f  o  o  o 

I  I  I  I  o 

4  4  4  4 

-  0  -  -  0 

4  w  4  4  w 

0  0  0  0  ± 

_  0  0  0  0  \ 


0  " 

0 

0 


1 

4 

1 

2 

1 


Then  C\  =  {1,2}  and  C 2  =  {5,6}  are  irreducible  closed  sets.  States  3  and 
4  are  transient  because  of  the  path  3  -A  4  -A  6  and  once  you  hit  state  6 
you  cannot  return  to  3  or  4.  Since  Pu(l)  >  0,  all  the  states  are  aperiodic.  In 
summary,  3  and  4  are  transient  while  1,  2,  5,  and  6  are  ergodic.  ■ 


23.29  Example  (Hardy-Weinberg).  Here  is  a  famous  example  from  genetics. 
Suppose  a  gene  can  be  type  A  or  type  a.  There  are  three  types  of  people  (called 
genotypes):  AA,  Aa,  and  aa.  Let  (p,  g,  r)  denote  the  fraction  of  people  of  each 
genotype.  We  assume  that  everyone  contributes  one  of  their  two  copies  of  the 
gene  at  random  to  their  children.  We  also  assume  that  mates  are  selected  at 
random.  The  latter  is  not  realistic  however,  it  is  often  reasonable  to  assume 
that  you  do  not  choose  your  mate  based  on  whether  they  are  AA,  Aa,  or 
aa.  (This  would  be  false  if  the  gene  was  for  eye  color  and  if  people  chose 
mates  based  on  eye  color.)  Imagine  if  we  pooled  everyone’s  genes  together. 
The  proportion  of  A  genes  is  P  =  p  +  (q/ 2)  and  the  proportion  of  a  genes  is 
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Q  =  r  +  (q/ 2).  A  child  is  AA  with  probability  P2,  aA  with  probability  2PQ, 
and  aa  with  probability  Q 2 .  Thus,  the  fraction  of  A  genes  in  this  generation 
is 

p2  +  PQ=(p+  |)2+(p+|)  (r+|). 

However,  r  =  1  —  p  —  q.  Substitute  this  in  the  above  equation  and  you  get 
P2  +  PQ  =  P.  A  similar  calculation  shows  that  the  fraction  of  “a”  genes  is 
Q.  We  have  shown  that  the  proportion  of  type  A  and  type  a  is  P  and  Q  and 
this  remains  stable  after  the  first  generation.  The  proportion  of  people  of  type 
AA,  Aa,  aa  is  thus  (P2,  2 PQ,  Q 2)  from  the  second  generation  and  on.  This  is 
called  the  Hardy- Weinberg  law. 

Assume  everyone  has  exactly  one  child.  Now  consider  a  fixed  person  and 
let  Xn  be  the  genotype  of  their  nth  descendant.  This  is  a  Markov  chain  with 
state  space  X  =  {AA,  Aa ,  aa}.  Some  basic  calculations  will  show  you  that  the 
transition  matrix  is 

"  P  Q  0 

P  P+Q  Q 

2  2  2 

OP  Q  _ 

The  stationary  distribution  is  7 r  =  (P2,  2 PQ,  Q2).  u 


23.30  Example  (Markov  chain  Monte  Carlo).  In  Chapter  24  we  will  present  a 
simulation  method  called  Markov  chain  Monte  Carlo  (MCMC).  Here  is  a  brief 
description  of  the  idea.  Let  f(x)  be  a  probability  density  on  the  real  line  and 
suppose  that  f(x)  =  cg(x)  where  g(x)  is  a  known  function  and  c  >  0  is 
unknown.  In  principle,  we  can  compute  c  since  f  f(x)dx  =  1  implies  that 
c  =  1/  f  g(x)dx.  However,  it  may  not  be  feasible  to  perform  this  integral,  nor 
is  it  necessary  to  know  c  in  the  following  algorithm.  Let  Xq  be  an  arbitrary 
starting  value.  Given  Xo,  ...,  Xi,  draw  Xi+ 1  as  follows.  First,  draw  W  ~ 
N(Xi,b2)  where  b  >  0  is  some  fixed  constant.  Let 


r  =  min 


g(w)  \ 
gW  /' 


Draw  U  ^  Uniform(0, 1)  and  set 


X 


i+ 1  — 


W  if  U  <  r 
X i  if  U  >  r. 


We  will  see  in  Chapter  24  that,  under  weak  conditions,  X0,Xi,...,  is  an 
ergodic  Markov  chain  with  stationary  distribution  /.  Hence,  we  can  regard 
the  draws  as  a  sample  from  /.  ■ 
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Inference  for  Markov  Chains.  Consider  a  chain  with  finite  state  space 
X  =  {1,2,...,  N}.  Suppose  we  observe  n  observations  Xi, . . . ,  Xn  from  this 
chain.  The  unknown  parameters  of  a  Markov  chain  are  the  initial  probabilities 
fi o  =  (fi o(l),/io(2), . . . ,)  and  the  elements  of  the  transition  matrix  P.  Each 
row  of  P  is  a  multinomial  distribution.  So  we  are  essentially  estimating  N 
distributions  (plus  the  initial  probabilities).  Let  be  the  observed  number 
of  transitions  from  state  i  to  state  j.  The  likelihood  function  is 

n  N  N 

£(Mo,P)  =  Vo(xo)Y[pxr-uxr  =  Po{xo)  IHW- 

r= 1  i=l j=l 


There  is  only  one  observation  on  /xq  so  we  can’t  estimate  that.  Rather,  we 
focus  on  estimating  P.  The  mle  is  obtained  by  maximizing  £(/xo,P)  subject 
to  the  constraint  that  the  elements  are  non- negative  and  the  rows  sum  to  1. 
The  solution  is 


Pij 


rii 


where  =  J2jLi  nij-  Here  we  are  assuming  that  >  0.  If  not,  then  we  set 
Pij  =  0  by  convention. 


23.31  Theorem  (Consistency  and  Asymptotic  Normality  of  the  mle).  Assume  that 
the  chain  is  ergodic.  Let  Pij{n)  denote  the  mle  after  n  observations.  Then 

^  p 

Pij  in) — >Pij-  Also, 


\J Ni(n)(pij  -  Pa) 


where  the  left-hand  side  is  a  matrix ,  Ni(n) 

Pij(i-Pij) 


ij,ki 


' PijPii 


0 


iV(0,  E) 

=  E"=l  H Xr 

(■ hj )  =  ik,t) 
i  =  k,j  ^  £ 
otherwise. 


i)  and 


23.3  Poisson  Processes 

The  Poisson  process  arises  when  we  count  occurrences  of  events  over  time,  for 
example,  traffic  accidents,  radioactive  decay,  arrival  of  email  messages,  etc. 
As  the  name  suggests,  the  Poisson  process  is  intimately  related  to  the  Poisson 
distribution.  Let’s  first  review  the  Poisson  distribution. 

Recall  that  X  has  a  Poisson  distribution  with  parameter  A  —  written  X  ~ 
Poisson(A)  —  if 

g— A  a^c 

P(X  =  x)  =  p(x;  A)  = - : — ,  x  =  0, 1,  2, . . . 

nr* 

tAy  • 


23.3  Poisson  Processes 


395 


Also  recall  that  E(X)  =  A  and  V(X)  =  A.  If  X  ~  Poisson(A),  Y  ~  Poisson(z/) 
and  XUY,  then  X-\-Y  ^  Poisson(A+z/).  Finally,  ifN~  Poisson(A)  and  Y\N  = 
n  ~  Binomial(n,p),  then  the  marginal  distribution  of  Y  is  Y  ~  Poisson(Ap). 

Now  we  describe  the  Poisson  process.  Imagine  that  you  are  at  your  com¬ 
puter.  Each  time  a  new  email  message  arrives  you  record  the  time.  Let  Xt  be 
the  number  of  messages  you  have  received  up  to  and  including  time  t.  Then, 
{Xt  ■  t  G  [0,  oo)}  is  a  stochastic  process  with  state  space  A  =  {0, 1,2,.. 

A  process  of  this  form  is  called  a  counting  process.  A  Poisson  process  is 
a  counting  process  that  satisfies  certain  conditions.  In  what  follows,  we  will 
sometimes  write  X(t)  instead  of  Xt.  Also,  we  need  the  following  notation. 
Write  f{h)  =  o(h)  if  f(h)/h  — >>  0  as  h  — >►  0.  This  means  that  f(h)  is  smaller 
than  h  when  h  is  close  to  0.  For  example,  h 2  =  o(h). 

23.32  Definition.  A  Poisson  process  is  a  stochastic  process 

{Xt:  t  G  [0,  oo) y  with  state  space  X  —  jT,  1,2,.. . }  such  that  I 

1.  X(0)  =  0. 

2.  For  any  0  =  to  <  t\  <  t^  <  •  •  •  <  tn,  the  increments 

are  independent. 

3.  There  is  a  function  A (t)  such  that 

F(X(t  +  h)-X(t)  =  1)  =  A  (t)h  +  o(h)  (23.16) 

F(X(t  +  h)-X(t)>  2)  =  o(h).  (23.17) 

We  call  A (t)  the  intensity  function. 

The  last  condition  means  that  the  probability  of  an  event  in  [£,  t  +  h\  is 
approximately  h\(t)  while  the  probability  of  more  than  one  event  is  small. 


23.33  Theorem.  If  Xt  is  a  Poisson  process  with  intensity  function  A (t),  then 

X{s  Ft)  —  X(s)  ~  Poisson(m(s  Ft)  —  m(s)) 

where 

m(t)  =  /  A(s)  ds. 

Jo 

In  particular ,  X(t)  ~  Poisson(m(t)).  Hence ,  E (X(t))  =  m(t)  andY(X(t))  = 
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23.34  Definition.  A  Poisson  process  with  intensity  function  X (t)  =  A  for 
some  A  >  0  is  called  a  homogeneous  Poisson  process  with  rate  A.  In 
this  case , 

X(t)  ~  Poisson(At). 

Let  X(t)  be  a  homogeneous  Poisson  process  with  rate  A.  Let  Wn  be  the 
time  at  which  the  nth  event  occurs  and  set  Wo  =  0.  The  random  variables 
Wo,  Wi, . . . ,  are  called  waiting  times.  Let  Sn  =  VFn+i  —  Wn.  Then  So,  Si, ... , 
are  called  sojourn  times  or  interarrival  times. 

23.35  Theorem.  The  sojourn  times  So,  Si, . . .  are  HD  random  variables.  Their 
distribution  is  exponential  with  mean  1  /A,  that  is,  they  have  density 

f(s)  =  Xe~Xs ,  s  >  0. 

The  waiting  time  Wn  ~  Gamma(n,  1/A)  i.e.,  it  has  density 

/M  =  -LaW**. 

Hence,  E(Wn)  =  n/X  andV(Wn)  =  n/A2. 

Proof.  First,  we  have 

P(Si  >t)=  F(X(t)  =  0)  =  e~Xt 

with  shows  that  the  CDF  for  Si  is  1  —  e~Xt.  This  shows  the  result  for  Si.  Now, 
P(S2  >  t|Si  =8) 

Hence,  S2  has  an  exponential  distribution  and  is  independent  of  Si .  The  result 
follows  by  repeating  the  argument.  The  result  for  Wn  follows  since  a  sum  of 
exponentials  has  a  Gamma  distribution.  ■ 

23.36  Example.  Figure  23.3  shows  requests  to  a  WWW  server  in  Calgary.1 
Assuming  that  this  is  a  homogeneous  Poisson  process,  N  =  X(T)  ^  Poisson(AT). 
The  likelihood  is 

£( A)  oc  e~XT(\T)N 


=  P(no  events  in  (s,  s  +  t\  |Si  =  s) 

=  P(no  events  in  (<s,  s  +  t])  (increments  are  independent) 


1See  http://ita.ee.lbl.gov/html/contrib/Calgary-HTTP.html  for  more  information. 
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O  400  800  1 200 


time 

FIGURE  23.3.  Hits  on  a  web  server.  Each  vertical  line  represents  one  event. 


which  is  maximized  at 


48.0077 


in  units  per  minute.  Let’s  now  test  the  assumption  that  the  data  follow  a  ho¬ 
mogeneous  Poisson  process  using  a  goodness-of-ht  test.  We  divide  the  interval 
[0,  T]  into  4  equal  length  intervals  ii,  /2, I3 , 1 4.  If  the  process  is  a  homogeneous 
Poisson  process  then,  given  the  total  number  of  events,  the  probability  that  an 
event  falls  into  any  of  these  intervals  must  be  equal.  Let  Pi  be  the  probability 
of  a  point  being  in  Ii.  The  null  hypothesis  is  that  pi  =  P2  =  P3  =  Pa  =  1/4. 
We  can  test  this  hypothesis  using  either  a  likelihood  ratio  test  or  a  y2  test. 
The  latter  is 


E 


(Oj  -  Ej )2 

Ei 


where  Oi  is  the  number  of  observations  in  Ii  and  Ei  =  n/ 4  is  the  expected 
number  under  the  null.  This  yields  y2  =  252  with  a  p-value  near  0.  This  is 
strong  evidence  against  the  null  so  we  reject  the  hypothesis  that  the  data  are 
from  a  homogeneous  Poisson  process.  This  is  hardly  surprising  since  we  would 
expect  the  intensity  to  vary  as  a  function  of  time.  ■ 


23.4  Bibliographic  Remarks 

This  is  standard  material  and  there  are  many  good  references  including  Grim- 
mett  and  Stirzaker  (1982),  Taylor  and  Karlin  (1994),  Guttorp  (1995),  and 
Ross  (2002).  The  following  exercises  are  from  those  texts. 
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23.5  Exercises 

1.  Let  Xo,Xi,...  be  a  Markov  chain  with  states  {0,1,2}  and  transition 
matrix 


0.1 

0.2 

0.7 

0.9 

0.1 

0.0 

0.1 

0.8 

0.1 

Assume  that  /io  =  (0.3,  0.4,  0.3).  Find  P(Xo  =  0,  Xi  =  1,X2  =  2)  and 
P(Xo  =  0,X1  =  l,X2  =  l). 

2.  Let  Yi,  >2, . . .  be  a  sequence  of  iid  observations  such  that  P(Y  =  0)  = 
0.1,  ¥(Y  =  1)  =  0.3,  P(Y  =  2)  =  0.2,  P (Y  =  3)  =  0.4.  Let  X0  =  0  and 
let 

Xn  =  max{Yi, . .  .,Yn}. 

Show  that  X0,  Xi, ...  is  a  Markov  chain  and  find  the  transition  matrix. 

3.  Consider  a  two- state  Markov  chain  with  states  X  =  {1,2}  and  transition 
matrix 

1  —  a  a 
P  =  [_  b  1  -b 

where  0  <  a  <  1  and  0  <  b  <  1.  Prove  that 


lim  Pn  = 

n—>  oo 

4.  Consider  the  chain  from  question  3  and  set  a  =  .1  and  b  =  .3.  Simulate 
the  chain.  Let 

Tl 

pn(i)  =  -Ti(xi  =  i) 

n 

i= 1 

Tl 

Pn(2)  =  -V/(Ii  =  2) 

n 

i= 1 

be  the  proportion  of  times  the  chain  is  in  state  1  and  state  2.  Plot  pn(  1) 
and  pn  (2)  versus  n  and  verify  that  they  converge  to  the  values  predicted 
from  the  answer  in  the  previous  question. 

5.  An  important  Markov  chain  is  the  branching  process  which  is  used  in 
biology,  genetics,  nuclear  physics,  and  many  other  fields.  Suppose  that 
an  animal  has  Y  children.  Let  =  P(Y  =  k).  Hence,  pk  >  0  for  all 
k  and  Yl'kLoPk  =  1-  Assume  each  animal  has  the  same  lifespan  and 


b  a 

a+6  a+6 

b  a 

-  a+6  a+6  . 
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that  they  produce  offspring  according  to  the  distribution  p^.  Let  Xn  be 
the  number  of  animals  in  the  nth  generation.  Let  Y^n\  . . . ,  Yx^  be  the 
offspring  produced  in  the  nth  generation.  Note  that 

Y  _  yM  i  I  \r(n) 

An+ 1  —  *1  H - \~  rXn  . 

Let  fi  =  E(y)  and  a2  =  V(T).  Assume  throughout  this  question  that 
X0  =  1.  Let  M(n)  =  E(Xn)  and  V(n)  =  V(Xn). 

(a)  Show  that  M(n  +  1)  =  /iM(n)  and  V(n  -\- 1)  =  a2M(n)  +  p?V (n). 

(b)  Show  that  M(n)  =  /in  and  that  V (n)  =  a2fin~1(  1  +  //  +  •••+  /in_1). 

(c)  What  happens  to  the  variance  if  fi  >  1?  What  happens  to  the  vari¬ 
ance  if  fi  =  1?  What  happens  to  the  variance  if  p  <  1? 

(d)  The  population  goes  extinct  if  Xn  =  0  for  some  n.  Let  us  thus  define 
the  extinction  time  N  by 


N  =  min {n  :  Xn  =  0}. 


Let  F(ri)  =  P(iV  <  n)  be  the  CDF  of  the  random  variable  N .  Show  that 


F(n)  ='}2Pk(F(n-l))k,  n  =  1,2,... 
k= 0 


Hint:  Note  that  the  event  {N  <  n}  is  the  same  as  event  {Xn  =  0}. 
Thus,  P({iV  <  n})  =  ¥({Xn  =  0}).  Let  k  be  the  number  of  offspring 
of  the  original  parent.  The  population  becomes  extinct  at  time  n  if  and 
only  if  each  of  the  k  sub-populations  generated  from  the  k  offspring  goes 
extinct  in  n  —  1  generations. 

(e)  Suppose  that  po  =  1/4,  pi  =  1/2,  p2  =  1/4.  Use  the  formula  from 
(5d)  to  compute  the  CDF  F(n). 


6.  Let 


P 


0.40 

0.50 

0.10 

0.05 

0.70 

0.25 

0.05 

0.50 

0.45 

Find  the  stationary  distribution  it. 


7.  Show  that  if  i  is  a  recurrent  state  and  i  o  j,  then  j  is  a  recurrent  state. 
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8.  Let 


5  0  3  0  0  5 

2  I  I  0  0  0 

0  0  0  0  1  0 

\  \  \  o  0  \ 

0  0  1  0  0  0 

0  0  0  0  0  1 


Which  states  are  transient?  Which  states  are  recurrent? 


9.  Let 


P 


0  1 
1  0 


Show  that  7T  =  (1/2, 1/2)  is  a  stationary  distribution.  Does  this  chain 
converge?  Why/why  not? 


10.  Let  0  <  p  <  1  and  q 


1  —  p.  Let 


P 


q  p  0  0  0 
q  0  p  0  0 
q  0  0  p  0 
q  0  0  0  p 
1  0  0  0  0 


Find  the  limiting  distribution  of  the  chain. 


11.  Let  X(t)  be  an  inhomogeneous  Poisson  process  with  intensity  function 
A (t)  >  0.  Let  A (t)  =  Jq  A (u)du.  Define  U(s)  =  X(t)  where  s  =  A (t). 
Show  that  y(s)  is  a  homogeneous  Poisson  process  with  intensity  A  =  1. 

12.  Let  X(t)  be  a  Poisson  process  with  intensity  A.  Find  the  conditional 
distribution  of  X(t)  given  that  X(t  +  s)  =  n. 

13.  Let  X(t)  be  a  Poisson  process  with  intensity  A.  Find  the  probability 
that  X(t)  is  odd,  i.e.  P (X(t)  =  1,  3,  5, . . .). 

14.  Suppose  that  people  logging  in  to  the  University  computer  system  is 
described  by  a  Poisson  process  X(t)  with  intensity  A.  Assume  that  a 
person  stays  logged  in  for  some  random  time  with  CDF  G.  Assume  these 
times  are  all  independent.  Let  Y (t)  be  the  number  of  people  on  the 
system  at  time  t.  Find  the  distribution  of  Y(t). 

15.  Let  X(t)  be  a  Poisson  process  with  intensity  A.  Let  VFi,  LF2, . . . ,  be  the 
waiting  times.  Let  /  be  an  arbitrary  function.  Show  that 


23.5  Exercises 


401 


16.  A  two-dimensional  Poisson  point  process  is  a  process  of  random  points 
on  the  plane  such  that  (i)  for  any  set  A ,  the  number  of  points  falling 
in  A  is  Poisson  with  mean  A ju(A)  where  ju(A)  is  the  area  of  A ,  (ii)  the 
number  of  events  in  non-overlapping  regions  is  independent.  Consider 
an  arbitrary  point  xo  in  the  plane.  Let  X  denote  the  distance  from  xo 
to  the  nearest  random  point.  Show  that 

P(X  >t)  =  e_A7r*2 


and 


24 

Simulation  Methods 


In  this  chapter  we  will  show  how  simulation  can  be  used  to  approximate  inte¬ 
grals.  Our  leading  example  is  the  problem  of  computing  integrals  in  Bayesian 
inference  but  the  techniques  are  widely  applicable.  We  will  look  at  three  inte¬ 
gration  methods:  (i)  basic  Monte  Carlo  integration,  (ii)  importance  sampling, 
and  (iii)  Markov  chain  Monte  Carlo  (MCMC). 


24.1  Bayesian  Inference  Revisited 


Simulation  methods  are  especially  useful  in  Bayesian  inference  so  let  us  briefly 
review  the  main  ideas  in  Bayesian  inference.  See  Chapter  11  for  more  details. 
Given  a  prior  f(6)  and  data  Xn  =  (Xi, . . . ,  Xn)  the  posterior  density  is 


mxn)  = 


mm 


where  C{6)  is  the  likelihood  function  and 


c  =  j  c{e)f{e)  de 


is  the  normalizing  constant.  The  posterior  mean  is 


0  =  /  0f(0\Xn)d0  = 


J  6C(6)f(0)d0 
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If  6^  =  (#i, . . . ,  6k)  is  multidimensional,  then  we  might  be  interested  in  the 
posterior  for  one  of  the  components,  #i,  say.  This  marginal  posterior  density 
is 


/  ' 

which  involves  high-dimensional  integration. 

When  0  is  high-dimensional,  it  may  not  be  feasible  to  calculate  these  inte¬ 
grals  analytically.  Simulation  methods  will  often  be  helpful. 


j  f(e1,...,ek\xn)de2  •  ••dek 


m \xn)  =  j 


24.2  Basic  Monte  Carlo  Integration 


Suppose  we  want  to  evaluate  the  integral 


I 


b 

h{x)  dx 


for  some  function  h.  If  h  is  an  “easy”  function  like  a  polynomial  or  trigono¬ 
metric  function,  then  we  can  do  the  integral  in  closed  form.  If  h  is  complicated 
there  may  be  no  known  closed  form  expression  for  I.  There  are  many  numer¬ 
ical  techniques  for  evaluating  I  such  as  Simpson’s  rule,  the  trapezoidal  rule 
and  Gaussian  quadrature.  Monte  Carlo  integration  is  another  approach  for 
approximating  I  which  is  notable  for  its  simplicity,  generality  and  scalability. 

Let  us  begin  by  writing 


I 


b 

h(x)dx 


b 

w(x)f(x)dx 


(24.1) 


where  w(x)  =  h(x)(b—a)  and  f(x )  =  1/(6— a).  Notice  that  /  is  the  probability 
density  for  a  uniform  random  variable  over  (a,  b).  Hence, 


/  =  Ef(w(X)) 

where  X  ~  Unif  (a,  6).  If  we  generate  Xi, . . .  ,Xjy  ~  Unif  (a,  6),  then  by  the 
law  of  large  numbers 

1  N 

I=-J2w(Xi)-^E(w(X))=L  (24-2) 

i=  1 

This  is  the  basic  Monte  Carlo  integration  method.  We  can  also  compute 
the  standard  error  of  the  estimate 


s 

Xn 


se  = 
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where 

.2  EtiOW)2 

N-  1 

where  Yi  =  w(Xi).  A  1  —  a  confidence  interval  for  I  is  I  ±za/2 se.  We  can  take 
N  as  large  as  we  want  and  hence  make  the  length  of  the  confidence  interval 
very  small. 

24.1  Example.  Let  h(x)  =  x 3.  Then,  I  =  x3dx  =  1/4.  Based  on  N  = 
10,  000  observations  from  a  Uniform (0, 1)  we  get  I  =  .248  with  a  standard 
error  of  .0028.  ■ 

A  generalization  of  the  basic  method  is  to  consider  integrals  of  the  form 

I  =  [  h{x)f(x)dx  (24.3) 


where  f{pc)  is  a  probability  density  function.  Taking  /  to  be  a  Uniform  (a,b) 
gives  us  the  special  case  above.  Now  we  draw  Xi, . . . ,  Xjy  ~  f  and  take 

^  1  N 
i= 1 

as  before. 

24.2  Example.  Let 

/(,)  = 

be  the  standard  Normal  PDF.  Suppose  we  want  to  compute  the  CDF  at  some 
point  x: 

/. X 

f(s)ds  =  4>(x). 

-oo 

Write 

I  =  J  h(s)f(s)ds 

where 

/>(»)=( ; 

s  >  x. 

Now  we  generate  Xi, . . . ,  Xn  ~  iV(0, 1)  and  set 

—  1  v  number  of  observations  <  x 

!=1 = - jv - ■ 

i 

For  example,  with  x  =  2,  the  true  answer  is  <L(2)  =  .9772  and  the  Monte 
Carlo  estimate  with  N  =  10,000  yields  .9751.  Using  N  =  100,000  we  get 
.9771.  ■ 
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24.3  Example  (Bayesian  Inference  for  Two  Binomials).  Let  X  ^  Binomial(n,pi 
and  Y  ~  Binomial(m,p2)-  We  would  like  to  estimate  5  =  P2  —  Pi-  The  MLE 
is  5  =  P2  —  Pi  =  (Y/m)  —  (X/n).  We  can  get  the  standard  error  se  using  the 
delta  method  which  yields 


se 


pi(i  —  pi)  p2(i  -h) 
n  m 


and  then  construct  a  95  percent  confidence  interval  5  db  2  se.  Now  consider  a 
Bayesian  analysis.  Suppose  we  use  the  prior  f(pi,P2)  =  f(pi)f(P2)  —  1?  that 
is,  a  flat  prior  on  (pi,P2)-  The  posterior  is 


f(pi,P2\X,Y)  <xpX(l-pi)n  ^Pl{l-P2) 


x  v. 


,ra- 


Y 


The  posterior  mean  of  S  is 


8=1  [  8(p1,p2)f(pi,P2\X,Y) 

0  Jo 


1  pi 


(P2  ~Pl)f(Pl,P2\X,Y) 


0  Jo 


If  we  want  the  posterior  density  of  5  we  can  first  get  the  posterior  CDF 


^  V 


I  -Y.  A. 


7  / 


where  A  =  {(pi,P2)  •  P2  —  Pi  <  c}.  The  density  can  then  be  obtained  by 
differentiating  F. 

To  avoid  all  these  integrals,  let’s  use  simulation.  Note  that  /(pi,P2|X,  Y)  = 
f(pi\X)f(p2\Y)  which  implies  that  pi  and  p2  are  independent  under  the  pos¬ 
terior  distribution.  Also,  we  see  thatpi|X  ^  Beta(X  +  l,  n— X  +  l)  andp2|T  ^ 
Beta(T  +  1,  m  —  Y  +  1).  Hence,  we  can  simulate  (P-^,  T^), . . . ,  (P±N\  P^) 
from  the  posterior  by  drawing 


p[i}  ~  Beta(X  +  l,n-X  +  1) 
P2(i)  ~  Beta(T  +  1,  m  -  Y  +  1) 

for  i  =  1, . . . ,  N.  Now  let  <5«  =  P2W  -  P}i}.  Then, 


<5 


We  can  also  get  a  95  percent  posterior  interval  for  5  by  sorting  the  simulated 
values,  and  finding  the  .025  and  .975  quantile.  The  posterior  density  f(5 |X,  Y) 
can  be  obtained  by  applying  density  estimation  techniques  to  ? 

or,  simply  by  plotting  a  histogram.  For  example,  suppose  that  n  =  m  =  10, 
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FIGURE  24.1.  Posterior  of  S  from  simulation. 

X  =  8  and  Y  =  6.  From  a  posterior  sample  of  size  1000  we  get  a  95  percent 
posterior  interval  of  (-0.52,0.20).  The  posterior  density  can  be  estimated  from 
a  histogram  of  the  simulated  values  as  shown  in  Figure  24.1.  ■ 


24.4  Example  (Bayesian  Inference  for  Dose  Response).  Suppose  we  conduct  an 
experiment  by  giving  rats  one  of  ten  possible  doses  of  a  drug,  denoted  by 
x\  <  X2  <  .  •  •  <  #io.  For  each  dose  level  Xi  we  use  n  rats  and  we  observe 
1},  the  number  that  survive.  Thus  we  have  ten  independent  binomials  Yi  ^ 
Binomial(n,p^).  Suppose  we  know  from  biological  considerations  that  higher 
doses  should  have  higher  probability  of  death.  Thus,  pi  <  P2  <  •  •  •  <  pio-  We 
want  to  estimate  the  dose  at  which  the  animals  have  a  50  percent  chance  of 
dying.  This  is  called  the  LD50.  Formally,  5  =  Xj  where 

j  =  min{i  :  pi  >  .50}. 

Notice  that  5  is  implicitly  a  (complicated)  function  of  pi, . . .  ,pio  so  we  can 
write  5  =  g(pi , . . . ,  pio)  for  some  g.  This  just  means  that  if  we  know  (pi , . . . ,  pio) 
then  we  can  find  5.  The  posterior  mean  of  5  is 


-L  9 (yPi ?  •  •  •  ?  Pio)  f  (.Pi ?  •  •  •  ?  Pio I F} -j  •  •  •  -j  Yio)dpidp2  •  •  •  dp\ q . 


The  integral  is  over  the  region 


A  =  {(pi, . . .  ,pio)  :  pi  <  •  •  •  <  pio}. 


The  posterior  CDF  of  5  is 
F(c|Yi,...,Yio)  =  F(5  <c|Yi,...,Yio) 


[  f(Pi,---,Pw\Yi,---,Y10)dp1dp2...dpw 

JB 


•  •  • 
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where 

B  =  Ap|  joi, . . .  ,pw)  :  g(pi,. . .  ,pw)  <  c 

We  need  to  do  a  10-dimensional  integral  over  a  restricted  region  A.  Instead, 
we  will  use  simulation.  Let  us  take  a  flat  prior  truncated  over  A.  Except  for 
the  truncation,  each  Pi  has  once  again  a  Beta  distribution.  To  draw  from  the 
posterior  we  do  the  following  steps: 

(1)  Draw  Pi  rsj  Beta(Y*  +  1,  n  —  Yi  +  1),  i  =  1, . . . ,  10. 

(2)  If  Pi  <  p2  <  •  •  •  <  Pio  keep  this  draw.  Otherwise,  throw  it  away  and 
draw  again  until  you  get  one  you  can  keep. 

(3)  Let  5  =  Xj  where 


j  =  min{i  :  Pi  >  .50}. 


We  repeat  this  N  times  to  get  . . . ,  5^  and  take 


1 


M{S\Y1,...,Y10)^-YlS(i 


0 


S  is  a  discrete  variable.  We  can  estimate  its  probability  mass  function  by 


1 


N 


ns  =  Xj\Yu  . . . ,  Vo)  «  ^  E  ^  =  3)- 


i=  1 


For  example,  consider  the  following  data: 


Dose 

Number  of  animals 
Number  of  survivors  Yi 


123456789  10 

15  15  15  15  15  15  15  15  15  15 

0  0  2  2  8  10  12  14  15  14 


The  posterior  draws  for  pi, . . .  ,pio  are  shown  in  the  second  panel  in  the 
figure.  We  find  that  that  5  =  4.04  with  a  95  percent  interval  of  (3,5).  ■ 


24.3  Importance  Sampling 

Consider  again  the  integral  I  =  f  h(x)f(x)dx  where  /  is  a  probability  density. 
The  basic  Monte  Carlo  method  involves  sampling  from  /.  However,  there 
are  cases  where  we  may  not  know  how  to  sample  from  /.  For  example,  in 
Bayesian  inference,  the  posterior  density  density  is  is  obtained  by  multiplying 
the  likelihood  C{6)  times  the  prior  f(0).  There  is  no  guarantee  that  f{6 \x) 
will  be  a  known  distribution  like  a  Normal  or  Gamma  or  whatever. 
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Importance  sampling  is  a  generalization  of  basic  Monte  Carlo  which  over¬ 
comes  this  problem.  Let  g  be  a  probability  density  that  we  know  how  to 
simulate  from.  Then 


I 


h(x)f(x)dx 


h(x)f(x) 

g(x) 


g(x)dx  =  E  g(Y) 


(24.4) 


where  Y  =  h(X)  f  (X) / g(X)  and  the  expectation  E g(Y)  is  with  respect  to  g. 
We  can  simulate  Xi, . . . ,  Xjy  ~  g  and  estimate  I  by 


N 


h(Xj)f(Xi) 

g(Xi) 


(24.5) 


^  P 

This  is  called  importance  sampling.  By  the  law  of  large  numbers,  I — >  I. 
However,  there  is  a  catch.  It’s  possible  that  I  might  have  an  infinite  standard 
error.  To  see  why,  recall  that  I  is  the  mean  of  w(x)  =  h(x)f(x)/g(x).  The 
second  moment  of  this  quantity  is 

(21„ 

If  g  has  thinner  tails  than  /,  then  this  integral  might  be  infinite.  To  avoid  this, 
a  basic  rule  in  importance  sampling  is  to  sample  from  a  density  g  with  thicker 
tails  than  /.  Also,  suppose  that  g(x)  is  small  over  some  set  A  where  f(x)  is 
large.  Again,  the  ratio  of  f/g  could  be  large  leading  to  a  large  variance.  This 
implies  that  we  should  choose  g  to  be  similar  in  shape  to  /.  In  summary,  a 
good  choice  for  an  importance  sampling  density  g  should  be  similar  to  /  but 
with  thicker  tails.  In  fact,  we  can  say  what  the  optimal  choice  of  g  is. 

24.5  Theorem.  The  choice  of  g  that  minimizes  the  variance  of  I  is 

\h(x)\f{x) 

o w  =  /imsii/m*- 


Proof.  The  variance  of  w  =  fh/g  is 


2\\2 


EflK)  -  (E(u^)) 


w2(x)g(x)dx 


w(x)g(x)dx 


h2(x)f2(x) 

g2(x) 

h2(x)f2(x) 

g2(x) 


g{x)dx 


g(x)dx 


h(x)f(x) 

g(x) 


g{x)dx 


h(x)  f  (x)dx 
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The  second  integral  does  not  depend  on  g ,  so  we  only  need  to  minimize  the 
first  integral.  From  Jensen’s  inequality  (Theorem  4.9)  we  have 


Ea(W2)  >  (Efl(|W|))2 


h(x)\f(x)dx 


This  establishes  a  lower  bound  on  ]Kg(W2).  However,  IE g*(W2)  equals  this 
lower  bound  which  proves  the  claim.  ■ 

This  theorem  is  interesting  but  it  is  only  of  theoretical  interest.  If  we  did 
not  know  how  to  sample  from  /  then  it  is  unlikely  that  we  could  sample  from 
\h{x)\f{x) /  f  \h(s)\f(s)ds.  In  practice,  we  simply  try  to  find  a  thick-tailed 
distribution  g  which  is  similar  to  f\h\. 

24.6  Example  (Tail  Probability).  Let’s  estimate  I  =  ¥(Z  >  3)  =  .0013  where 
Z  ^  iV(0, 1).  Write  I  =  f  h(x)f(x)dx  where  f(x)  is  the  standard  Normal 
density  and  h(x)  =  1  if  x  >  3,  and  0  otherwise.  The  basic  Monte  Carlo 
estimator  is  I  =  iV_1  JT  h(Xi)  where  Xi, . . . ,  Xjy  ~  iV(0, 1).  Using  N  =  100 
we  find  (from  simulating  many  times)  that  E(/)  =  .0015  and  V(/)  =  .0039. 
Notice  that  most  observations  are  wasted  in  the  sense  that  most  are  not  near 
the  right  tail.  Now  we  will  estimate  this  with  importance  sampling  taking  g 
to  be  a  Normal(4,l)  density.  We  draw  values  from  g  and  the  estimate  is  now 
I  =  N-1  Y.if(Xi)h(Xi)/g(Xi).  In  this  case  we  find  that  IE(_Z")  —  .0011  and 
¥(/)  =  .0002.  We  have  reduced  the  standard  deviation  by  a  factor  of  20.  ■ 

24.7  Example  (Measurement  Model  With  Outliers).  Suppose  we  have  measure¬ 
ments  Xi, . . . ,  Xn  of  some  physical  quantity  0.  A  reasonable  model  is 


Xi  —  0  +  6i. 


If  we  assume  that  ^  iV(0, 1)  then  Xi  ~  iV(6^,l).  However,  when  taking 
measurements,  it  is  often  the  case  that  we  get  the  occasional  wild  observation, 
or  outlier.  This  suggests  that  a  Normal  might  be  a  poor  model  since  Normals 
have  thin  tails  which  implies  that  extreme  observations  are  rare.  One  way  to 
improve  the  model  is  to  use  a  density  for  with  a  thicker  tail,  for  example, 
a  t-distribution  with  v  degrees  of  freedom  which  has  the  form 


(zv+l)/2 


Smaller  values  of  v  correspond  to  thicker  tails.  For  the  sake  of  illustration  we 
will  take  v  =  3.  Suppose  we  observe  n  Xi  =  0  +  e^,  i  =  l,...,n  where  has 
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a  t  distribution  with  v  =  3.  We  will  take  a  flat  prior  on  6.  The  likelihood  is 
C(9)  =  niLi  —  9)  and  the  posterior  mean  of  0  is 

„  fomde 

f  £(9)d0  ' 

We  can  estimate  the  top  and  bottom  integral  using  importance  sampling.  We 
draw  #i, . .  • ,  On  ~  g  and  then 

J_  0jC{0j) 

N  l^j= 1  gjej) 

AM  ' 

N  2^j  =  l  g^j) 

To  illustrate  the  idea,  we  drew  n  =  2  observations.  The  posterior  mean  (com¬ 
puted  numerically)  is  -0.54.  Using  a  Normal  importance  sampler  g  yields  an 
estimate  of  -0.74.  Using  a  Cauchy  (t-distribution  with  1  degree  of  freedom) 
importance  sampler  yields  an  estimate  of  -0.53.  ■ 


24.4  MCMC  Part  I:  The  Metropolis-Hastings 
Algorithm 

Consider  once  more  the  problem  of  estimating  the  integral  I  =  f  h(x)f(x)dx. 
Now  we  introduce  Markov  chain  Monte  Carlo  (MCMC)  methods.  The  idea  is 
to  construct  a  Markov  chain  Xi,X2, . . . ,  whose  stationary  distribution  is  /. 
Under  certain  conditions  it  will  then  follow  that 

1  N 

~Y,h(.Xi)^MKX))  =  i. 

i=  1 

This  works  because  there  is  a  law  of  large  numbers  for  Markov  chains;  see 
Theorem  23.25. 

The  Metropolis-Hastings  algorithm  is  a  specific  MCMC  method  that 
works  as  follows.  Let  q(y \x)  be  an  arbitrary,  friendly  distribution  (i.e.,  we 
know  how  to  sample  from  q(y\x)).  The  conditional  density  q(y \x)  is  called 
the  proposal  distribution.  The  Metropolis-Hastings  algorithm  creates  a 
sequence  of  observations  Xo,  Xi, . . . ,  as  follows. 

Metropolis-Hastings  Algorithm 

Choose  Xo  arbitrarily.  Suppose  we  have  generated  Xo,  Xi, . . . ,  X^.  To 
generate  X^+i  do  the  following: 

(1)  Generate  a  proposal  or  candidate  value  Y  r^j  q(y\Xi ).  I 
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(2)  Evaluate  r  =  r(X^,T)  where 


(3)  Set 


r(x,y) 


min 


f(y)  q(x\y )  .  1 

f(x)q(y\x)'  /' 


Y  with  probability  r 
Xi  with  probability  1  —  r. 


24.8  Remark.  A  simple  way  to  execute  step  (3)  is  to  generate  U  r^j  (0,1).  If 
U  <  r  set  Xi+ 1  =  Y  otherwise  set  X^+i  =  Xi. 


24.9  Remark.  A  common  choice  for  q(y\x)  is  N(x,b2)  for  some  b  >  0.  This 
means  that  the  proposal  is  draw  from  a  Normal,  centered  at  the  current 
value.  In  this  case,  the  proposal  density  q  is  symmetric,  q(y\x)  =  q(x\y),  and 
r  simplifies  to 


r  —  min 


By  construction,  Xq, Xi,  ...  is  a  Markov  chain.  But  why  does  this  Markov 
chain  have  /  as  its  stationary  distribution?  Before  we  explain  why,  let  us  first 
do  an  example. 


24.10  Example.  The  Cauchy  distribution  has  density 


1  1 

7T  1  T  X2 


Our  goal  is  to  simulate  a  Markov  chain  whose  stationary  distribution  is  /. 
As  suggested  in  the  remark  above,  we  take  q(y\x)  to  be  a  N(x,b2).  So  in  this 
case, 


r(x,y) 


min 


f(y) 


min 


1  +  x2 

l  +  y21 


So  the  algorithm  is  to  draw  Y  ~  N(Xi ,  b2)  and  set 


Y  with  probability  r(X^,T) 

Xi  with  probability  1  —  r(X^,  Y). 


The  simulator  requires  a  choice  of  b.  Figure  24.2  shows  three  chains  of  length 
N  =  1,000  using  b  =  .1,  b  =  1  and  b  =  10.  Setting  b  =  .1  forces  the  chain 
to  take  small  steps.  As  a  result,  the  chain  doesn’t  “explore”  much  of  the 
sample  space.  The  histogram  from  the  sample  does  not  approximate  the  true 
density  very  well.  Setting  b  =  10  causes  the  proposals  to  often  be  far  in  the 
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FIGURE  24.2.  Three  Metropolis  chains  corresponding  to  6  =  .1,  6=1,  b  =  10. 

tails,  making  r  small  and  hence  we  reject  the  proposal  and  keep  the  chain 
at  its  current  position.  The  result  is  that  the  chain  “gets  stuck”  at  the  same 
place  quite  often.  Again,  this  means  that  the  histogram  from  the  sample  does 
not  approximate  the  true  density  very  well.  The  middle  choice  avoids  these 
extremes  and  results  in  a  Markov  chain  sample  that  better  represents  the 
density  sooner.  In  summary,  there  are  tuning  parameters  and  the  efficiency 
of  the  chain  depends  on  these  parameters.  We’ll  discuss  this  in  more  detail 
later.  ■ 

If  the  sample  from  the  Markov  chain  starts  to  “look  like”  the  target  distri¬ 
bution  /  quickly,  then  we  say  that  the  chain  is  “mixing  well.”  Constructing  a 
chain  that  mixes  well  is  somewhat  of  an  art. 

Why  It  Works.  Recall  from  Chapter  23  that  a  distribution  i r  satisfies 
detailed  balance  for  a  Markov  chain  if 

Pij^i 

We  showed  that  if  ir  satisfies  detailed  balance,  then  it  is  a  stationary  distri¬ 
bution  for  the  chain. 

Because  we  are  now  dealing  with  continuous  state  Markov  chains,  we  will 
change  notation  a  little  and  write  p(x,  y)  for  the  probability  of  making  a 
transition  from  x  to  y.  Also,  let’s  use  f(x)  instead  of  7 r  for  a  distribution.  In 
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this  new  notation,  /  is  a  stationary  distribution  if  f(x)  =  f  f(y)p(y,x)dy  and 
detailed  balance  holds  for  /  if 

f(x)p(x,y)  =  f(y)p(y,x).  (24.7) 

Detailed  balance  implies  that  /  is  a  stationary  distribution  since,  if  detailed 
balance  holds,  then 


/  f(y)p(y,x)dy  =  /  f(x)p(x,y)dy  =  f(x)  /  p(x,y)dy  =  f(x) 


which  shows  that  f(x)  =  f  f{y)p{y ,  x)dy  as  required.  Our  goal  is  to  show  that 
/  satisfies  detailed  balance  which  will  imply  that  /  is  a  stationary  distribution 
for  the  chain. 

Consider  two  points  x  and  y.  Either 


f(x)q(y\x)  <  f(y)q(x\y)  or  f(x)q(y\x)  >  f(y)q(x\y). 


We  will  ignore  ties  (which  occur  with  probability  zero  for  continuous  distribu¬ 
tions).  Without  loss  of  generality,  assume  that  f(x)q(y \x)  >  f(y)q(x\y).  This 
implies  that 


r(x,y) 


f(y)  q{x\y) 

f{x)  q{y\x) 


and  that  r(y,x)  =  1.  Now  p(x,y)  is  the  probability  of  jumping  from  x  to  y. 
This  requires  two  things:  (i)  the  proposal  distribution  must  generate  y,  and 
(ii)  you  must  accept  y.  Thus, 


p(x,y)  =  q(y\x)r(x,  y)  =  q(y\x)j^\q^V\  =  j r\q(x\y ) 

f{x)  q(y  x)  f(x) 


Therefore. 


/ (x)p(x,  y)  =  f(y)q(x\y). 


(24.8) 


On  the  other  hand,  p(y,  x)  is  the  probability  of  jumping  from  y  to  x.  This 
requires  two  things:  (i)  the  proposal  distribution  must  generate  x,  and  (ii)  you 
must  accept  x.  This  occurs  with  probability  p(y,x)  =  q(x\y)r(y,x)  =  q(x\y). 
Hence, 

f(y)p(y,x)  =  f(y)q(x\y).  (24.9) 


Comparing  (24.8)  and  (24.9),  we  see  that  we  have  shown  that  detailed  balance 
holds. 


24.5  MCMC  Part  II:  Different  Flavors 


415 


24.5  MCMC  Part  II:  Different  Flavors 


There  are  different  types  of  MCMC  algorithm.  Here  we  will  consider  a  few  of 
the  most  popular  versions. 

Random-Walk-Metropolis-Hastings.  In  the  previous  section  we  con¬ 
sidered  drawing  a  proposal  Y  of  the  form 


Y  =  Xz  +  g 


where  g  comes  from  some  distribution  with  density  g.  In  other  words,  q(y \x)  = 
g(y  —  x).  We  saw  that  in  this  case, 


r(x,y ) 


min 


i  ZM\ 

’  m  j ' 


This  is  called  a  random- walk-Metropolis— Hastings  method.  The  reason 
for  the  name  is  that,  if  we  did  not  do  the  accept-reject  step,  we  would  be 
simulating  a  random  walk.  The  most  common  choice  for  g  is  a  X( 0,  b2).  The 
hard  part  is  choosing  b  so  that  the  chain  mixes  well.  A  good  rule  of  thumb  is: 
choose  b  so  that  you  accept  the  proposals  about  50  percent  of  the  time. 

Warning!  This  method  doesn’t  make  sense  unless  X  takes  values  on  the 
whole  real  line.  If  X  is  restricted  to  some  interval  then  it  is  best  to  transform 
X.  For  example,  if  X  E  (0,  oo)  then  you  might  take  Y  =  logX  and  then 
simulate  the  distribution  for  Y  instead  of  X. 

Independence- Metropolis— Hastings.  This  is  an  importance-sampling 
version  of  MCMC.  We  draw  the  proposal  from  a  fixed  distribution  g.  Gen¬ 
erally,  g  is  chosen  to  be  an  approximation  to  /.  The  acceptance  probability 
becomes 


r(x,y) 


min 


1 


f(y)  g{x)  \ 
f(x)  g(y)  J  ' 


Gibbs  Sampling.  The  two  previous  methods  can  be  easily  adapted,  in 
principle,  to  work  in  higher  dimensions.  In  practice,  tuning  the  chains  to  make 
them  mix  well  is  hard.  Gibbs  sampling  is  a  way  to  turn  a  high-dimensional 
problem  into  several  one-dimensional  problems. 

Here’s  how  it  works  for  a  bivariate  problem.  Suppose  that  (X,  T)  has  den¬ 
sity  fx,y(x^y)'  First,  suppose  that  it  is  possible  to  simulate  from  the  condi¬ 
tional  distributions  fx\y(x\y)  and  fy\x(y\x)-  Let  (Xo,To)  be  starting  values. 
Assume  we  have  drawn  (Xq,  Tq),  . . . ,  (Xn,  Yn).  Then  the  Gibbs  sampling  al¬ 
gorithm  for  getting  (Xn+i,Tn+i)  is: 
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Gibbs  Sampling 

Xn+l  ~  fx\Y(x\Yn) 

Pn+1  ^  fY\x(y\Xn+i) 
repeat 

This  generalizes  in  the  obvious  way  to  higher  dimensions. 

24.11  Example  (Normal  Hierarchical  Model).  Gibbs  sampling  is  very  useful 
for  a  class  of  models  called  hierarchical  models.  Here  is  a  simple  case. 
Suppose  we  draw  a  sample  of  k  cities.  From  each  city  we  draw  people  and 
observe  how  many  people  Yi  have  a  disease.  Thus,  Yi  ~  Binomial(n^,p^).  We 
are  allowing  for  different  disease  rates  in  different  cities.  We  can  also  think  of 
the  p[s  as  random  draws  from  some  distribution  F.  We  can  write  this  model 
in  the  following  way: 

Pi  ~  F 

Yi\Pi  =  Pi  ~  Binomial^,  ft). 

We  are  interested  in  estimating  the  p's  and  the  overall  disease  rate  f  pdF(p). 

To  proceed,  it  will  simplify  matters  if  we  make  some  transformations  that 
allow  us  to  use  some  Normal  approximations.  Let  pi  =  Yi/rii.  Recall  that 
Pi  «  N(pi,Si)  where  s*  =  v/pi(  1  -  p* )/rij.  Let  ipi  =  log(p,/(l  -  p*))  and 
define  Zi  =  ipi  =  \og(pi/(l  ~Pi))-  By  the  delta  method, 

where  af  =  l/(npi(l  — Pi )).  Experience  shows  that  the  Normal  approximation 
for  p)  is  more  accurate  than  the  Normal  approximation  for  p  so  we  shall  work 
with  ip.  We  shall  treat  as  known.  Furthermore,  we  shall  take  the  distribution 
of  the  p)\s  to  be  Normal.  The  hierarchical  model  is  now 

4>i  ~  N(p,T2) 

Zi\ipi  ~  N (ipi,  o\). 

As  yet  another  simplification  we  take  r  =  1.  The  unknown  parameter  are 
6  =  (/i,  -0i, . . . ,  t/jk)-  The  likelihood  function  is 

m  oc 


nexp{-|(^->-)2}exp 
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If  we  use  the  prior  /(/x)  oc  1  then  the  posterior  is  proportional  to  the  likelihood. 
To  use  Gibbs  sampling,  we  need  to  find  the  conditional  distribution  of  each 
parameter  conditional  on  all  the  others.  Let  us  begin  by  finding  /(/x|rest) 
where  “rest”  refers  to  all  the  other  variables.  We  can  throw  away  any  terms 
that  don’t  involve  /x.  Thus, 


/(/X  |  rest) 


(X 


OC 


eXP{-^(M-^)2} 


where 

b  = 

i 


Hence  we  see  that  /x|rest  ^  iV(6, 1/fc).  Next  we  will  find  /(direst).  Again,  we 
can  throw  away  any  terms  not  involving  ^  leaving  us  with 


/(direst) 


OC 


oc 


exp 


where 


e*  = 


% 


1  + 


and  £ 


and  so  ^|rest  ^  iV(e^,df).  The  Gibbs  sampling  algorithm  then  involves  iter¬ 
ating  the  following  steps  N  times: 


draw  /x  N(b,  v2) 

draw  ipi  iV(ei,  d\) 


draw  tpk  ~  N(ek,d2k). 

It  is  understood  that  at  each  step,  the  most  recently  drawn  version  of  each 
variable  is  used. 

We  generated  a  numerical  example  with  k  —  20  cities  and  n  —  20  people 
from  each  city.  After  running  the  chain,  we  can  convert  each  ^  back  into  pi 
by  way  of  Pi  =  e^/(l  +  e^*).  The  raw  proportions  are  shown  in  Figure  24.4. 
Figure  24.3  shows  “trace  plots”  of  the  Markov  chain  for  pi  and  /x.  Figure 
24.4  shows  the  posterior  for  /x  based  on  the  simulated  values.  The  second 
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panel  of  Figure  24.4  shows  the  raw  proportions  and  the  Bayes  estimates.  Note 
that  the  Bayes  estimates  are  “shrunk”  together.  The  parameter  r  controls 
the  amount  of  shrinkage.  We  set  r  =  1  but,  in  practice,  we  should  treat  r  as 
another  unknown  parameter  and  let  the  data  determine  how  much  shrinkage 
is  needed.  ■ 
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FIGURE  24.3.  Posterior  simulation  for  Example  24.11.  The  top  panel  shows  simu¬ 
lated  values  of  p\.  The  top  panel  shows  simulated  values  of  jl. 


So  far  we  assumed  that  we  know  how  to  draw  samples  from  the  conditionals 
fx\Y(x\y)  and  fY\x(y\x)-  If  we  don’t  know  how,  we  can  still  use  the  Gibbs 
sampling  algorithm  by  drawing  each  observation  using  a  Metropolis-Hastings 
step.  Let  q  be  a  proposal  distribution  for  x  and  let  q  be  a  proposal  distribution 
for  y.  When  we  do  a  Metropolis  step  for  X,  we  treat  Y  as  fixed.  Similarly, 
when  we  do  a  Metropolis  step  for  T,  we  treat  X  as  fixed.  Here  are  the  steps: 
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Metropolis  within  Gibbs 

(la)  Draw  a  proposal  Z  q(z\Xn). 

(lb)  Evaluate 

.  f  f(Z,Yn)  q(Xn\Z) 


(lc)  Set 


•  I  J  \ —  7  ~  it, ;  i  \ - it,  —  /  -t 

r  =  inin  <  — - - - - - ' — -  l 

\f{Xn,Yn)q{Z\Xny 


Xn-\- 1  — 


Z  with  probability  r 
Xn  with  probability  1 


(2a)  Draw  a  proposal  Z  q(z\Yn). 


(2b)  Evaluate 


.  f  f(Xn+1,Z)  q(Yn\Z)  1 
\f(Xn+1,Yn)q(Z\Yny 


(2c)  Set 


n+ 1 


Z  with  probability  r 
Yn  with  probability  1  —  r. 
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24.6  Bibliographic  Remarks 

MCMC  methods  go  back  to  the  effort  to  build  the  atomic  bomb  in  World  War 
II.  They  were  used  in  various  places  after  that,  especially  in  spatial  statistics. 
There  was  a  new  surge  of  interest  in  the  1990s  that  still  continues.  My  main 
reference  for  this  chapter  was  Robert  and  Casella  (1999).  See  also  Gelman 
et  al.  (1995)  and  Gilks  et  al.  (1998). 


24.7  Exercises 


1.  Let 


I 


2 


1 


e-x2/2 

\/2tt 


dx. 


(a)  Estimate  I  using  the  basic  Monte  Carlo  method.  Use  N  =  100,  000. 
Also,  find  the  estimated  standard  error. 

(b)  Find  an  (analytical)  expression  for  the  standard  error  of  your  esti¬ 
mate  in  (a).  Compare  to  the  estimated  standard  error. 

(c)  Estimate  I  using  importance  sampling.  Take  g  to  be  iV(1.5,u2)  with 
v  =  .1,  v  =  1  and  v  =  10.  Compute  the  (true)  standard  errors  in  each 
case.  Also,  plot  a  histogram  of  the  values  you  are  averaging  to  see  if 
there  are  any  extreme  values. 

(d)  Find  the  optimal  importance  sampling  function  g*.  What  is  the 
standard  error  using  g*l 


2.  Here  is  a  way  to  use  importance  sampling  to  estimate  a  marginal  density. 
Let  fx,y(xi  v)  a  bivariate  density  and  let  (Xi,  X2), . . . ,  (X/v,  Yn)  ~ 
fx,Y • 

(a)  Let  w(x)  be  an  arbitrary  probability  density  function.  Let 

Mx)  =  "h  /«-(*.*>  • 

Show  that,  for  each  x, 

fx(x)  4  fx(x). 


Find  an  expression  for  the  variance  of  this  estimator. 

(b)  Let  Y  ~  N( 0, 1)  and  X\Y  =  y  ~  N(y ,  1  +  y 2).  Use  the  method  in 
(a)  to  estimate  fx(x). 
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3.  Here  is  a  method  called  accept— reject  sampling  for  drawing  observa¬ 
tions  from  a  distribution. 

(a)  Suppose  that  /  is  some  probability  density  function.  Let  g  be  any 
other  density  and  suppose  that  f(x)  <  Mg(x)  for  all  x,  where  M  is  a 
known  constant.  Consider  the  following  algorithm: 

(step  1):  Draw  X  ~  g  and  U  ~  Unif(0, 1); 

(step  2):  If  U  <  f  (X)  /  (M  g(X))  set  Y  =  X,  otherwise  go  back  to  step 
1.  (Keep  repeating  until  you  finally  get  an  observation.) 

Show  that  the  distribution  of  Y  is  /. 

(b)  Let  /  be  a  standard  Normal  density  and  let  g(pc)  =  1/(1  +  x2)  be 
the  Cauchy  density.  Apply  the  method  in  (a)  to  draw  1,000  observations 
from  the  Normal  distribution.  Draw  a  histogram  of  the  sample  to  verify 
that  the  sample  appears  to  be  Normal. 


4.  A  random  variable  Z  has  a  inverse  Gaussian  distribution  if  it  has 
density 

f(z)  oc  z-3/2ex p  j-#iz  _  ~  +  2\/#i#2  +  log  (>/20^)  j  ,  z  >  0 
where  6\  >  0  and  O2  >  0  are  parameters.  It  can  be  shown  that 


E  (Z)  =  \l9fi  and  e(| 


L  +  X 

@2  2<92 


(a)  Let  6\  =  1.5  and  #2  =  2.  Draw  a  sample  of  size  1,000  using  the 
independence-Metropolis-Hastings  method.  Use  a  Gamma  distribution 
as  the  proposal  density.  To  assess  the  accuracy,  compare  the  mean  of  Z 
and  1  jZ  from  the  sample  to  the  theoretical  means  Try  different  Gamma 
distributions  to  see  if  you  can  get  an  accurate  sample. 

(b)  Draw  a  sample  of  size  1,000  using  the  random- walk-Metropolis- 
Hastings  method.  Since  z  >  0  we  cannot  just  use  a  Normal  density. 
One  strategy  is  this.  Let  W  =  logZ.  Find  the  density  of  W.  Use  the 
random- walk-Metropolis-Hastings  method  to  get  a  sample  Wi, . . . ,  LE/v 
and  let  Z^  =  eWi .  Assess  the  accuracy  of  the  simulation  as  in  part  (a). 

5.  Get  the  heart  disease  data  from  the  book  web  site.  Consider  a  Bayesian 
analysis  of  the  logistic  regression  model 

gdo + Yjj = i  Pj  xj 

^  _|_  gdo+Xb  =  i  Pjxj 
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Use  the  flat  prior  f(/3 o,  ...,/?&)  oc  1.  Use  the  Gibbs-Metropolis  algorithm 
to  draw  a  sample  of  size  10,000  from  the  posterior  /(/3q,  Pi  |data).  Plot 
histograms  of  the  posteriors  for  the  (3^ s.  Get  the  posterior  mean  and  a 
95  percent  posterior  interval  for  each  f3j. 

(b)  Compare  your  analysis  to  a  frequentist  approach  using  maximum 
likelihood. 
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